--- title: "Data masking" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data masking} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(comment = "#", collapse = TRUE, eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE) ``` This vignette demonstrates how to use data-masking to dynamically specify `learner_args`, `scorer_args`, `prediction_args`, and `splitter_args` so that each is evaluated on the appropriate subset of data. The modeltuning package provides two special verbs for use inside these argument lists: `.data` and `.index`. - `.data`: Accesses the data available at the time the model is fit. For example, if the training data includes a column `w` of observation-level weights and you’re performing cross-validation, using `.data$w` within `learner_args` ensures that the correct subset of weights is used for each fold. - `.index`: Accesses the indices of the current subset of data. In the same example, using `.data$w[.index]` inside `learner_args` also guarantees that only the relevant subset of weights is used for each fold. The following sections provide worked examples illustrating the use of `.data` and `.index` in the CV, GridSearch, and GridSearchCV classes. # CV Example Below we show how to supply observation-level weights in cross-validation using both `.data` and `.index`. We'll use the `mtcars` dataset and create a new column `w` of random weights. ```{r} library(rsample) library(yardstick) library(modeltuning) mtcars$w <- abs(rnorm(nrow(mtcars))) splitter <- function(data, ...) lapply(vfold_cv(data, ...)$splits, \(.x) .x$in_id) ``` ## Using .data First, we show how to use `.data` to supply the weights via `learner_args`. We can also supply the weights for the out-of-sample predictions via `prediction_args` and `scorer_args`. Finally, note that we can also supply dynamic arguments to the `splitter` function via `splitter_args`. Here we demonstrate by stratifying the folds by `cyl`. ```{r} mtcars_cv <- CV$new( learner = glm, learner_args = list(weights = .data$w, family = gaussian), splitter = splitter, splitter_args = list(v = 2, strata = .data$cyl), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = .data$w)), prediction_args = list("rmse" = list(weights = .data$w)) ) mtcars_cv_fitted <- mtcars_cv$fit(formula = mpg ~ . - w, data = mtcars) coef(mtcars_cv_fitted$model) ``` We demonstrate that the fully fitted model is identical to using `glm` directly with the full dataset and weights. ```{r} coef(glm(mpg ~ . - w, data = mtcars, weights = mtcars$w)) ``` ## Using .index Next, we demonstrate that we can achieve the same result using `.index` instead of `.data`. Instead of accessing the underlying data with `.data`, we instead subset the raw weights vector with the indices of the current subset using `.index`. ```{r} mtcars_cv <- CV$new( learner = glm, learner_args = list(weights = mtcars$w[.index], family = gaussian), splitter = splitter, splitter_args = list(v = 2, strata = cyl), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])), prediction_args = list("rmse" = list(weights = mtcars$w[.index])) ) mtcars_cv_fitted <- mtcars_cv$fit(formula = mpg ~ . - w, data = mtcars) coef(mtcars_cv_fitted$model) ``` We again demonstrate that the fully fitted model is identical to using `glm` directly with the full dataset and weights. ```{r} coef(glm(mpg ~ . - w, data = mtcars, weights = mtcars$w)) ``` # GridSearch Example While this is less essential for grid search, since the training and evaluation datasets are fixed, we can still use the `.data` verb to dynamically access attributes of the in-sample and evalutaion datasets. However, since there is no sub-sampling, the `.index` verb is not defined and will result in an error. We demonstrate both below. ## Using .data ```{r} mtcars_train <- mtcars[1:25, ] mtcars_eval <- mtcars[26:nrow(mtcars), ] mtcars_gs <- GridSearch$new( learner = glm, tune_params = list(na.action = c(na.omit, na.fail)), learner_args = list(weights = .data$w, family = gaussian), evaluation_data = list(x = mtcars_eval, y = mtcars_eval$mpg), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = .data$w)), prediction_args = list("rmse" = list(weights = .data$w)) ) mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train) mtcars_gs_fitted$best_params ``` ## Will ERROR when .index is used ```{r, error=TRUE} mtcars_gs <- GridSearch$new( learner = glm, tune_params = list(na.action = c(na.omit, na.fail)), learner_args = list(weights = mtcars$w[.index], family = gaussian), evaluation_data = list(x = mtcars_eval[, -1], y = mtcars_eval[, 1]), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])), prediction_args = list("rmse" = list(weights = mtcars$w[.index])) ) mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train) ``` ## Raw weight vectors As discussed above, since the training and evaluation datasets are fixed, we can also just supply the raw weight vectors directly. ```{r} mtcars_gs <- GridSearch$new( learner = glm, tune_params = list(na.action = c(na.omit, na.fail)), learner_args = list(weights = mtcars_train$w, family = gaussian), evaluation_data = list(x = mtcars_eval[, -1], y = mtcars_eval[, 1]), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = mtcars_eval$w)), prediction_args = list("rmse" = list(weights = mtcars_eval$w)) ) mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train) coef(mtcars_gs_fitted$best_model) ``` # GridSearchCV Example Finally, we combine both these ideas in `GridSearchCV`, which performs grid search with cross-validation for error estimation. Here, as with `CV`, both `.data` and `.index` can be used to dynamically access the correct subsets of data. ## Using .data ```{r} mtcars_gs_cv <- GridSearchCV$new( learner = glm, tune_params = list(na.action = c(na.omit, na.fail)), learner_args = list(weights = .data$w, family = gaussian), splitter = splitter, splitter_args = list(v = 2, strata = cyl), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = .data$w)), prediction_args = list("rmse" = list(weights = .data$w)) ) mtcars_gs_cv_fitted <- mtcars_gs_cv$fit(formula = mpg ~ . - w, data = mtcars) coef(mtcars_gs_cv_fitted$best_model) ``` ## Using .index ```{r} mtcars_gs_cv <- GridSearchCV$new( learner = glm, tune_params = list(na.action = c(na.omit, na.fail)), learner_args = list(weights = mtcars$w[.index], family = gaussian), splitter = splitter, splitter_args = list(v = 2, strata = cyl), scorer = list("rmse" = yardstick::rmse_vec), scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])), prediction_args = list("rmse" = list(weights = mtcars$w[.index])) ) mtcars_gs_cv_fitted <- mtcars_gs_cv$fit(formula = mpg ~ . - w, data = mtcars) coef(mtcars_gs_cv_fitted$best_model) ```