Last active
August 12, 2019 20:35
-
-
Save jyuu/eaf755493e64a06af59ae743a4de2886 to your computer and use it in GitHub Desktop.
Review butcher
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| --- | |
| title: "Model Butcher" | |
| output: html_notebook | |
| --- | |
| # butcher | |
| New package that helps reduce the size of modeling objects saved | |
| to disk. | |
| ```{r install} | |
| # install.packages("butcher") | |
| library(butcher) | |
| library(parsnip) | |
| ``` | |
| ## Our approach | |
| Fitted models available from `parsnip` and identified what model | |
| components consumed the most memory. Example for `lm`: | |
| ```{r weigh} | |
| fit_lm <- function() { | |
| random_stuff <- runif(1e6) # we forgot about | |
| return( | |
| lm(mpg ~ cyl, data = mtcars) | |
| ) | |
| } | |
| lm_ex <- fit_lm() | |
| butcher::weigh(lm_ex) | |
| ``` | |
| Why does `terms` have such a large memory footprint? | |
| ```{r why terms} | |
| library(rlang) | |
| env_fit_lm <- attr(lm_ex$terms, ".Environment") | |
| env_fit_lm | |
| env_print(env_fit_lm) | |
| ``` | |
| Repeat this process to find that the common culprits were: | |
| 1. Call | |
| 2. Training data | |
| 3. Environment | |
| Additional clutter, but with small memory footprint: | |
| 4. Fitted values | |
| 5. Control parameters | |
| So we created five S3 generics | |
| ### `butcher::axe_call()` | |
| When model is evaluated with `do.call()`: | |
| ```{r call} | |
| # Without `do.call` | |
| without_do_call <- lm(mpg ~ cyl, data = mtcars) | |
| without_do_call | |
| # With `do.call` | |
| .f <- mpg ~ cyl | |
| .data <- mtcars | |
| with_do_call <- do.call(lm, list(formula = .f, data = .data)) | |
| with_do_call | |
| # Remove the call object | |
| cleaned_fit <- butcher::axe_call(with_do_call, verbose = TRUE) | |
| # How axing works, replace with prototype | |
| cleaned_fit$call | |
| # In addition... add butchered class | |
| class(cleaned_fit) | |
| # And attach new butcher attribute | |
| attr(cleaned_fit, "butcher_disabled") | |
| # Should we use the adjective disabled? | |
| print(cleaned_fit) | |
| summary(cleaned_fit) | |
| # Compare to output from using `do.call()` | |
| summary(with_do_call) | |
| ``` | |
| ### `butcher::axe_ctrl()` | |
| Control parameters are not needed if model update is not | |
| desired. | |
| ```{r control} | |
| library(C50) | |
| fit_c5 <- decision_tree("classification") %>% | |
| set_engine("C5.0") %>% | |
| fit(Species ~ ., data = iris) | |
| # Is there a threshold in memory released that makes something worth axing? | |
| c5_ex <- butcher::axe_ctrl(fit_c5, verbose = TRUE) | |
| ``` | |
| ### `butcher::axe_data()` | |
| Training data is often carried over into the final model output. | |
| ```{r data} | |
| library(sparklyr) | |
| sc <- spark_connect(master = "local") | |
| iris_tbls <- sdf_copy_to(sc, iris, overwrite = TRUE) %>% | |
| sdf_random_split(train = 2/3, validation = 1/3, seed = 2019) | |
| train <- iris_tbls$train | |
| fit_spark <- ml_decision_tree_classifier(train, Species ~ .) | |
| # See components of this fitted Spark model | |
| names(fit_spark) | |
| fit_spark$dataset | |
| # sparklyr's `ml_save()` function is smart already | |
| ml_save(fit_spark, path = "fitted_spark_obj", overwrite = TRUE) | |
| fit_spark_loaded <- ml_load(sc, "fitted_spark_obj") | |
| names(fit_spark_loaded) | |
| names(fit_spark$pipeline_model) | |
| # But in the case that the user serializes with `saveRDS()` | |
| saveRDS(fit_spark, file = "another_fitted_spark_obj.rds") | |
| fit_spark_loaded <- readRDS("another_fitted_spark_obj.rds") | |
| names(fit_spark_loaded) | |
| # ...butcher's axe methods are useful | |
| spark_ex <- butcher::axe_data(fit_spark, verbose = TRUE) | |
| ``` | |
| ### `butcher::axe_env()` | |
| User may accidentally carry over environments from use of formulas | |
| and quosures in modeling development. | |
| ```{r env} | |
| library(recipes) | |
| library(lobstr) | |
| create_recipes_object <- function() { | |
| some_junk_in_env <- runif(1e6) # we don't need | |
| print(obj_size(some_junk_in_env)) | |
| return( | |
| recipe(mpg ~ cyl, data = mtcars) %>% | |
| step_center(all_predictors()) %>% | |
| step_scale(all_predictors()) %>% | |
| step_spatialsign(all_predictors()) | |
| ) | |
| } | |
| recipes_ex <- create_recipes_object() | |
| # Each step has its own quosure, but the same environment is attached | |
| cleaned_recipes_ex <- butcher::axe_env(recipes_ex, verbose = TRUE) | |
| ``` | |
| ### `butcher::axe_fitted()` | |
| Fitted values are generally not needed as they can be derived | |
| on-the-fly. | |
| ```{r fitted} | |
| library(xgboost) | |
| # Load data | |
| data(agaricus.train) | |
| # Fit xgboost model | |
| fit_xgb <- xgboost(data = agaricus.train$data, | |
| label = agaricus.train$label, | |
| max_depth = 2, | |
| eta = 1, | |
| nthread = 2, | |
| nrounds = 2, | |
| objective = "binary:logistic") | |
| xgb_ex <- butcher::axe_fitted(fit_xgb, verbose = TRUE) | |
| ``` | |
| ### `butcher::butcher()` | |
| There is the option to execute all five of these generics with | |
| the function `butcher::butcher()`, without breaking the object's | |
| `predict()` method. | |
| ## Adding new axe methods | |
| My approach to fitting models is contrived. Make it as painless | |
| as possible for community to improve the existing set of methods | |
| by running: | |
| `butcher::new_model_butcher("new_model_class", "model_pkg")` | |
| User can use `butcher::weigh()` and `butcher::locate()` to | |
| figure out how to customize axe methods for new model object. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment