---
title: "Data masking"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data masking}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(comment = "#",
collapse = TRUE,
eval = TRUE,
echo = TRUE,
warning = FALSE,
message = FALSE)
```
This vignette demonstrates how to use data-masking to dynamically specify `learner_args`, `scorer_args`, `prediction_args`, and `splitter_args` so that each is evaluated on the appropriate subset of data.
The modeltuning package provides two special verbs for use inside these argument lists: `.data` and `.index`.
- `.data`: Accesses the data available at the time the model is fit. For example, if the training data includes a column `w` of observation-level weights and you’re performing cross-validation, using `.data$w` within `learner_args` ensures that the correct subset of weights is used for each fold.
- `.index`: Accesses the indices of the current subset of data. In the same example, using `.data$w[.index]` inside `learner_args` also guarantees that only the relevant subset of weights is used for each fold.
The following sections provide worked examples illustrating the use of `.data` and `.index` in the CV, GridSearch, and GridSearchCV classes.
# CV Example
Below we show how to supply observation-level weights in cross-validation using both
`.data` and `.index`. We'll use the `mtcars` dataset and create a new column `w` of random weights.
```{r}
library(rsample)
library(yardstick)
library(modeltuning)
mtcars$w <- abs(rnorm(nrow(mtcars)))
splitter <- function(data, ...) lapply(vfold_cv(data, ...)$splits, \(.x) .x$in_id)
```
## Using .data
First, we show how to use `.data` to supply the weights via `learner_args`. We can
also supply the weights for the out-of-sample predictions via `prediction_args` and
`scorer_args`. Finally, note that we can also supply dynamic arguments to the `splitter`
function via `splitter_args`. Here we demonstrate by stratifying the folds by `cyl`.
```{r}
mtcars_cv <- CV$new(
learner = glm,
learner_args = list(weights = .data$w, family = gaussian),
splitter = splitter,
splitter_args = list(v = 2, strata = .data$cyl),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = .data$w)),
prediction_args = list("rmse" = list(weights = .data$w))
)
mtcars_cv_fitted <- mtcars_cv$fit(formula = mpg ~ . - w, data = mtcars)
coef(mtcars_cv_fitted$model)
```
We demonstrate that the fully fitted model is identical to using `glm` directly with the
full dataset and weights.
```{r}
coef(glm(mpg ~ . - w, data = mtcars, weights = mtcars$w))
```
## Using .index
Next, we demonstrate that we can achieve the same result using `.index` instead of `.data`.
Instead of accessing the underlying data with `.data`, we instead subset the raw weights
vector with the indices of the current subset using `.index`.
```{r}
mtcars_cv <- CV$new(
learner = glm,
learner_args = list(weights = mtcars$w[.index], family = gaussian),
splitter = splitter,
splitter_args = list(v = 2, strata = cyl),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])),
prediction_args = list("rmse" = list(weights = mtcars$w[.index]))
)
mtcars_cv_fitted <- mtcars_cv$fit(formula = mpg ~ . - w, data = mtcars)
coef(mtcars_cv_fitted$model)
```
We again demonstrate that the fully fitted model is identical to using `glm` directly with the
full dataset and weights.
```{r}
coef(glm(mpg ~ . - w, data = mtcars, weights = mtcars$w))
```
# GridSearch Example
While this is less essential for grid search, since the training and evaluation datasets are fixed,
we can still use the `.data` verb to dynamically access attributes of the in-sample and evalutaion
datasets. However, since there is no sub-sampling, the `.index` verb is not defined and will result
in an error. We demonstrate both below.
## Using .data
```{r}
mtcars_train <- mtcars[1:25, ]
mtcars_eval <- mtcars[26:nrow(mtcars), ]
mtcars_gs <- GridSearch$new(
learner = glm,
tune_params = list(na.action = c(na.omit, na.fail)),
learner_args = list(weights = .data$w, family = gaussian),
evaluation_data = list(x = mtcars_eval, y = mtcars_eval$mpg),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = .data$w)),
prediction_args = list("rmse" = list(weights = .data$w))
)
mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train)
mtcars_gs_fitted$best_params
```
## Will ERROR when .index is used
```{r, error=TRUE}
mtcars_gs <- GridSearch$new(
learner = glm,
tune_params = list(na.action = c(na.omit, na.fail)),
learner_args = list(weights = mtcars$w[.index], family = gaussian),
evaluation_data = list(x = mtcars_eval[, -1], y = mtcars_eval[, 1]),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])),
prediction_args = list("rmse" = list(weights = mtcars$w[.index]))
)
mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train)
```
## Raw weight vectors
As discussed above, since the training and evaluation datasets are fixed, we can
also just supply the raw weight vectors directly.
```{r}
mtcars_gs <- GridSearch$new(
learner = glm,
tune_params = list(na.action = c(na.omit, na.fail)),
learner_args = list(weights = mtcars_train$w, family = gaussian),
evaluation_data = list(x = mtcars_eval[, -1], y = mtcars_eval[, 1]),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = mtcars_eval$w)),
prediction_args = list("rmse" = list(weights = mtcars_eval$w))
)
mtcars_gs_fitted <- mtcars_gs$fit(formula = mpg ~ . - w, data = mtcars_train)
coef(mtcars_gs_fitted$best_model)
```
# GridSearchCV Example
Finally, we combine both these ideas in `GridSearchCV`, which performs grid search
with cross-validation for error estimation. Here, as with `CV`, both `.data` and
`.index` can be used to dynamically access the correct subsets of data.
## Using .data
```{r}
mtcars_gs_cv <- GridSearchCV$new(
learner = glm,
tune_params = list(na.action = c(na.omit, na.fail)),
learner_args = list(weights = .data$w, family = gaussian),
splitter = splitter,
splitter_args = list(v = 2, strata = cyl),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = .data$w)),
prediction_args = list("rmse" = list(weights = .data$w))
)
mtcars_gs_cv_fitted <- mtcars_gs_cv$fit(formula = mpg ~ . - w, data = mtcars)
coef(mtcars_gs_cv_fitted$best_model)
```
## Using .index
```{r}
mtcars_gs_cv <- GridSearchCV$new(
learner = glm,
tune_params = list(na.action = c(na.omit, na.fail)),
learner_args = list(weights = mtcars$w[.index], family = gaussian),
splitter = splitter,
splitter_args = list(v = 2, strata = cyl),
scorer = list("rmse" = yardstick::rmse_vec),
scorer_args = list("rmse" = list(case_weights = mtcars$w[.index])),
prediction_args = list("rmse" = list(weights = mtcars$w[.index]))
)
mtcars_gs_cv_fitted <- mtcars_gs_cv$fit(formula = mpg ~ . - w, data = mtcars)
coef(mtcars_gs_cv_fitted$best_model)
```