Skip to content

Instantly share code, notes, and snippets.

@jyuu
Last active August 12, 2019 20:35
Show Gist options
  • Select an option

  • Save jyuu/eaf755493e64a06af59ae743a4de2886 to your computer and use it in GitHub Desktop.

Select an option

Save jyuu/eaf755493e64a06af59ae743a4de2886 to your computer and use it in GitHub Desktop.
Review butcher
---
title: "Model Butcher"
output: html_notebook
---
# butcher
New package that helps reduce the size of modeling objects saved
to disk.
```{r install}
# install.packages("butcher")
library(butcher)
library(parsnip)
```
## Our approach
Fitted models available from `parsnip` and identified what model
components consumed the most memory. Example for `lm`:
```{r weigh}
fit_lm <- function() {
random_stuff <- runif(1e6) # we forgot about
return(
lm(mpg ~ cyl, data = mtcars)
)
}
lm_ex <- fit_lm()
butcher::weigh(lm_ex)
```
Why does `terms` have such a large memory footprint?
```{r why terms}
library(rlang)
env_fit_lm <- attr(lm_ex$terms, ".Environment")
env_fit_lm
env_print(env_fit_lm)
```
Repeat this process to find that the common culprits were:
1. Call
2. Training data
3. Environment
Additional clutter, but with small memory footprint:
4. Fitted values
5. Control parameters
So we created five S3 generics
### `butcher::axe_call()`
When model is evaluated with `do.call()`:
```{r call}
# Without `do.call`
without_do_call <- lm(mpg ~ cyl, data = mtcars)
without_do_call
# With `do.call`
.f <- mpg ~ cyl
.data <- mtcars
with_do_call <- do.call(lm, list(formula = .f, data = .data))
with_do_call
# Remove the call object
cleaned_fit <- butcher::axe_call(with_do_call, verbose = TRUE)
# How axing works, replace with prototype
cleaned_fit$call
# In addition... add butchered class
class(cleaned_fit)
# And attach new butcher attribute
attr(cleaned_fit, "butcher_disabled")
# Should we use the adjective disabled?
print(cleaned_fit)
summary(cleaned_fit)
# Compare to output from using `do.call()`
summary(with_do_call)
```
### `butcher::axe_ctrl()`
Control parameters are not needed if model update is not
desired.
```{r control}
library(C50)
fit_c5 <- decision_tree("classification") %>%
set_engine("C5.0") %>%
fit(Species ~ ., data = iris)
# Is there a threshold in memory released that makes something worth axing?
c5_ex <- butcher::axe_ctrl(fit_c5, verbose = TRUE)
```
### `butcher::axe_data()`
Training data is often carried over into the final model output.
```{r data}
library(sparklyr)
sc <- spark_connect(master = "local")
iris_tbls <- sdf_copy_to(sc, iris, overwrite = TRUE) %>%
sdf_random_split(train = 2/3, validation = 1/3, seed = 2019)
train <- iris_tbls$train
fit_spark <- ml_decision_tree_classifier(train, Species ~ .)
# See components of this fitted Spark model
names(fit_spark)
fit_spark$dataset
# sparklyr's `ml_save()` function is smart already
ml_save(fit_spark, path = "fitted_spark_obj", overwrite = TRUE)
fit_spark_loaded <- ml_load(sc, "fitted_spark_obj")
names(fit_spark_loaded)
names(fit_spark$pipeline_model)
# But in the case that the user serializes with `saveRDS()`
saveRDS(fit_spark, file = "another_fitted_spark_obj.rds")
fit_spark_loaded <- readRDS("another_fitted_spark_obj.rds")
names(fit_spark_loaded)
# ...butcher's axe methods are useful
spark_ex <- butcher::axe_data(fit_spark, verbose = TRUE)
```
### `butcher::axe_env()`
User may accidentally carry over environments from use of formulas
and quosures in modeling development.
```{r env}
library(recipes)
library(lobstr)
create_recipes_object <- function() {
some_junk_in_env <- runif(1e6) # we don't need
print(obj_size(some_junk_in_env))
return(
recipe(mpg ~ cyl, data = mtcars) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_spatialsign(all_predictors())
)
}
recipes_ex <- create_recipes_object()
# Each step has its own quosure, but the same environment is attached
cleaned_recipes_ex <- butcher::axe_env(recipes_ex, verbose = TRUE)
```
### `butcher::axe_fitted()`
Fitted values are generally not needed as they can be derived
on-the-fly.
```{r fitted}
library(xgboost)
# Load data
data(agaricus.train)
# Fit xgboost model
fit_xgb <- xgboost(data = agaricus.train$data,
label = agaricus.train$label,
max_depth = 2,
eta = 1,
nthread = 2,
nrounds = 2,
objective = "binary:logistic")
xgb_ex <- butcher::axe_fitted(fit_xgb, verbose = TRUE)
```
### `butcher::butcher()`
There is the option to execute all five of these generics with
the function `butcher::butcher()`, without breaking the object's
`predict()` method.
## Adding new axe methods
My approach to fitting models is contrived. Make it as painless
as possible for community to improve the existing set of methods
by running:
`butcher::new_model_butcher("new_model_class", "model_pkg")`
User can use `butcher::weigh()` and `butcher::locate()` to
figure out how to customize axe methods for new model object.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment