jyuu · August 12, 2019 20:35
diff --git a/butcher_review.Rmd b/butcher_review.Rmd
 ---
 title: "Model Butcher"
 output: html_notebook
 ---

 # butcher 

 New package that helps reduce the size of modeling objects saved 
 to disk. 

 ```{r install}
 # install.packages("butcher")
 library(butcher)
 library(parsnip)
 ```

 ## Our approach

 Fitted models available from `parsnip` and identified what model 
 components consumed the most memory. Example for `lm`:  

 ```{r weigh}
 fit_lm <- function() {
  random_stuff <- runif(1e6) # we forgot about
  return(
    lm(mpg ~ cyl, data = mtcars)
  )
 }
 lm_ex <- fit_lm()
 butcher::weigh(lm_ex)
 ```

 Why does `terms` have such a large memory footprint?

 ```{r why terms}
 library(rlang)
 env_fit_lm <- attr(lm_ex$terms, ".Environment")
 env_fit_lm
 env_print(env_fit_lm)
 ```

 Repeat this process to find that the common culprits were: 

 1. Call 
 2. Training data
 3. Environment 

 Additional clutter, but with small memory footprint: 

 4. Fitted values
 5. Control parameters

 So we created five S3 generics

 ### `butcher::axe_call()`

 When model is evaluated with `do.call()`: 

 ```{r call}
 # Without `do.call` 
 without_do_call <- lm(mpg ~ cyl, data = mtcars)
 without_do_call 

 # With `do.call` 
 .f <- mpg ~ cyl
 .data <- mtcars
 with_do_call <- do.call(lm, list(formula = .f, data = .data))
 with_do_call 

 # Remove the call object
 cleaned_fit <- butcher::axe_call(with_do_call, verbose = TRUE)

 # How axing works, replace with prototype
 cleaned_fit$call

 # In addition... add butchered class 
 class(cleaned_fit)

 # And attach new butcher attribute 
 attr(cleaned_fit, "butcher_disabled")

 # Should we use the adjective disabled?  
 print(cleaned_fit) 
 summary(cleaned_fit)

 # Compare to output from using `do.call()`
 summary(with_do_call)
 ```

 ### `butcher::axe_ctrl()`

 Control parameters are not needed if model update is not
 desired. 

 ```{r control}
 library(C50)

 fit_c5 <- decision_tree("classification") %>%
  set_engine("C5.0") %>%
  fit(Species ~ ., data = iris)

 # Is there a threshold in memory released that makes something worth axing? 
 c5_ex <- butcher::axe_ctrl(fit_c5, verbose = TRUE)
 ```

 ### `butcher::axe_data()`

 Training data is often carried over into the final model output. 

 ```{r data}
 library(sparklyr)

 sc <- spark_connect(master = "local")

 iris_tbls <- sdf_copy_to(sc, iris, overwrite = TRUE) %>%
  sdf_random_split(train = 2/3, validation = 1/3, seed = 2019)

 train <- iris_tbls$train

 fit_spark <- ml_decision_tree_classifier(train, Species ~ .)

 # See components of this fitted Spark model 
 names(fit_spark)
 fit_spark$dataset

 # sparklyr's `ml_save()` function is smart already 
 ml_save(fit_spark, path = "fitted_spark_obj", overwrite = TRUE)
 fit_spark_loaded <- ml_load(sc, "fitted_spark_obj")
 names(fit_spark_loaded)
 names(fit_spark$pipeline_model)

 # But in the case that the user serializes with `saveRDS()` 
 saveRDS(fit_spark, file = "another_fitted_spark_obj.rds")
 fit_spark_loaded <- readRDS("another_fitted_spark_obj.rds")
 names(fit_spark_loaded)

 # ...butcher's axe methods are useful 
 spark_ex <- butcher::axe_data(fit_spark, verbose = TRUE)
 ```

 ### `butcher::axe_env()`

 User may accidentally carry over environments from use of formulas 
 and quosures in modeling development. 

 ```{r env} 
 library(recipes)
 library(lobstr)

 create_recipes_object <- function() {
  some_junk_in_env <- runif(1e6) # we don't need
  print(obj_size(some_junk_in_env))
  return(
    recipe(mpg ~ cyl, data = mtcars) %>%
      step_center(all_predictors()) %>%
      step_scale(all_predictors()) %>%
      step_spatialsign(all_predictors())
  )
 }

 recipes_ex <- create_recipes_object()

 # Each step has its own quosure, but the same environment is attached   
 cleaned_recipes_ex <- butcher::axe_env(recipes_ex, verbose = TRUE)
 ```

 ### `butcher::axe_fitted()`

 Fitted values are generally not needed as they can be derived 
 on-the-fly. 

 ```{r fitted}
 library(xgboost)

 # Load data
 data(agaricus.train)

 # Fit xgboost model
 fit_xgb <- xgboost(data = agaricus.train$data, 
                   label = agaricus.train$label, 
                   max_depth = 2, 
                   eta = 1, 
                   nthread = 2, 
                   nrounds = 2, 
                   objective = "binary:logistic")

 xgb_ex <- butcher::axe_fitted(fit_xgb, verbose = TRUE)
 ```

 ### `butcher::butcher()` 

 There is the option to execute all five of these generics with 
 the function `butcher::butcher()`, without breaking the object's
 `predict()` method. 

 ## Adding new axe methods 

 My approach to fitting models is contrived. Make it as painless
 as possible for community to improve the existing set of methods 
 by running: 

  `butcher::new_model_butcher("new_model_class", "model_pkg")` 

 User can use `butcher::weigh()` and `butcher::locate()` to 
 figure out how to customize axe methods for new model object.
	---
	title: "Model Butcher"
	output: html_notebook
	---

	# butcher

	New package that helps reduce the size of modeling objects saved
	to disk.

	```{r install}
	# install.packages("butcher")
	library(butcher)
	library(parsnip)
	```

	## Our approach

	Fitted models available from `parsnip` and identified what model
	components consumed the most memory. Example for `lm`:

	```{r weigh}
	fit_lm <- function() {
	random_stuff <- runif(1e6) # we forgot about
	return(
	lm(mpg ~ cyl, data = mtcars)
	)
	}
	lm_ex <- fit_lm()
	butcher::weigh(lm_ex)
	```

	Why does `terms` have such a large memory footprint?

	```{r why terms}
	library(rlang)
	env_fit_lm <- attr(lm_ex$terms, ".Environment")
	env_fit_lm
	env_print(env_fit_lm)
	```

	Repeat this process to find that the common culprits were:

	1. Call
	2. Training data
	3. Environment

	Additional clutter, but with small memory footprint:

	4. Fitted values
	5. Control parameters

	So we created five S3 generics

	### `butcher::axe_call()`

	When model is evaluated with `do.call()`:

	```{r call}
	# Without `do.call`
	without_do_call <- lm(mpg ~ cyl, data = mtcars)
	without_do_call

	# With `do.call`
	.f <- mpg ~ cyl
	.data <- mtcars
	with_do_call <- do.call(lm, list(formula = .f, data = .data))
	with_do_call

	# Remove the call object
	cleaned_fit <- butcher::axe_call(with_do_call, verbose = TRUE)

	# How axing works, replace with prototype
	cleaned_fit$call

	# In addition... add butchered class
	class(cleaned_fit)

	# And attach new butcher attribute
	attr(cleaned_fit, "butcher_disabled")

	# Should we use the adjective disabled?
	print(cleaned_fit)
	summary(cleaned_fit)

	# Compare to output from using `do.call()`
	summary(with_do_call)
	```

	### `butcher::axe_ctrl()`

	Control parameters are not needed if model update is not
	desired.

	```{r control}
	library(C50)

	fit_c5 <- decision_tree("classification") %>%
	set_engine("C5.0") %>%
	fit(Species ~ ., data = iris)

	# Is there a threshold in memory released that makes something worth axing?
	c5_ex <- butcher::axe_ctrl(fit_c5, verbose = TRUE)
	```

	### `butcher::axe_data()`

	Training data is often carried over into the final model output.

	```{r data}
	library(sparklyr)

	sc <- spark_connect(master = "local")

	iris_tbls <- sdf_copy_to(sc, iris, overwrite = TRUE) %>%
	sdf_random_split(train = 2/3, validation = 1/3, seed = 2019)

	train <- iris_tbls$train

	fit_spark <- ml_decision_tree_classifier(train, Species ~ .)

	# See components of this fitted Spark model
	names(fit_spark)
	fit_spark$dataset

	# sparklyr's `ml_save()` function is smart already
	ml_save(fit_spark, path = "fitted_spark_obj", overwrite = TRUE)
	fit_spark_loaded <- ml_load(sc, "fitted_spark_obj")
	names(fit_spark_loaded)
	names(fit_spark$pipeline_model)

	# But in the case that the user serializes with `saveRDS()`
	saveRDS(fit_spark, file = "another_fitted_spark_obj.rds")
	fit_spark_loaded <- readRDS("another_fitted_spark_obj.rds")
	names(fit_spark_loaded)

	# ...butcher's axe methods are useful
	spark_ex <- butcher::axe_data(fit_spark, verbose = TRUE)
	```

	### `butcher::axe_env()`

	User may accidentally carry over environments from use of formulas
	and quosures in modeling development.

	```{r env}
	library(recipes)
	library(lobstr)

	create_recipes_object <- function() {
	some_junk_in_env <- runif(1e6) # we don't need
	print(obj_size(some_junk_in_env))
	return(
	recipe(mpg ~ cyl, data = mtcars) %>%
	step_center(all_predictors()) %>%
	step_scale(all_predictors()) %>%
	step_spatialsign(all_predictors())
	)
	}

	recipes_ex <- create_recipes_object()

	# Each step has its own quosure, but the same environment is attached
	cleaned_recipes_ex <- butcher::axe_env(recipes_ex, verbose = TRUE)
	```

	### `butcher::axe_fitted()`

	Fitted values are generally not needed as they can be derived
	on-the-fly.

	```{r fitted}
	library(xgboost)

	# Load data
	data(agaricus.train)

	# Fit xgboost model
	fit_xgb <- xgboost(data = agaricus.train$data,
	label = agaricus.train$label,
	max_depth = 2,
	eta = 1,
	nthread = 2,
	nrounds = 2,
	objective = "binary:logistic")

	xgb_ex <- butcher::axe_fitted(fit_xgb, verbose = TRUE)
	```

	### `butcher::butcher()`

	There is the option to execute all five of these generics with
	the function `butcher::butcher()`, without breaking the object's
	`predict()` method.

	## Adding new axe methods

	My approach to fitting models is contrived. Make it as painless
	as possible for community to improve the existing set of methods
	by running:

	`butcher::new_model_butcher("new_model_class", "model_pkg")`

	User can use `butcher::weigh()` and `butcher::locate()` to
	figure out how to customize axe methods for new model object.
No results found