---
title: "Basic usage"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Basic usage}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(comment = "#",
collapse = TRUE,
eval = TRUE,
echo = TRUE,
warning = FALSE,
message = FALSE)
```
The goal of this vignette is to walk through `modeltuning` usage in
detail. We'll be training a classification model on the `iris` data-set to predict
whether a flower's species is Virginica or not.
# Load Packages
```{r Load Packages}
library(e1071)
library(modeltuning) # devtools::install_github("dmolitor/modeltuning")
library(yardstick)
```
## Data Prep
First, let's generate a bunch of synthetic data observations by adding random
noise to the original `iris` features and combining it into one big dataframe.
```{r Iris Big}
iris_new <- do.call(
what = rbind,
args = replicate(n = 10, iris, simplify = FALSE)
) |>
transform(
Sepal.Length = jitter(Sepal.Length, 0.1),
Sepal.Width = jitter(Sepal.Width, 0.1),
Petal.Length = jitter(Petal.Length, 0.1),
Petal.Width = jitter(Petal.Width, 0.1),
Species = factor(Species == "virginica")
)
# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]
# Quick overview of the dataset
summary(iris_new[, 1:4])
```
## Function arguments
### Common arguments
The following modeling approach holds for the `CV`, `GridSearch` and `GridSearchCV`
classes, which are all very slight variations of each other. Common arguments are as follows:
- `learner`: This is where you pass your predictive modeling function; in our case
a Support Vector Machine, so `e1071::svm` from the e1071 package.
- `scorer`: This is a named list of metric functions that will evaluate the model's
predictive performance. Each metric function should have two arguments, `truth` and
`estimate` that intake the true outcome values and the predicted outcome values. It
should output a scalar numeric score. The [yardstick](https://yardstick.tidymodels.org/)
package provides a wide array of these metric functions that should cover most
common cases. E.g. for the RMSE of a regression, `scorer = list(rmse = yarstick::rmse_vec)`.
- `learner_args`: This is a named list of function arguments that get passed
directly to the `learner` function. For example, the `e1071::svm` function
takes a `type` argument specifying whether it is a regression or classification
task. You could specify a classification task as `learner_args = list(type = "classification")`.
- `scorer_args`: This is a named list of function arguments to pass to the
scorer functions in `scorer`. This list should have one element per element in
`scorer`. E.g. if `scorer = list(rmse = rmse_vec, mae = mae_vec)` then `scorer_args = list(rmse = list(...), mae = list(...))`.
- `prediction_args`: This is similar to `learner_args`. It's a named list of
function arguments passed to the `predict` method. E.g. our SVM learner has a
predict argument called `probability` whether to predict outcome classes or
class probabilities. Specify class probabilities as
`prediction_args = list(probability = TRUE)`. Similar to `scorer_args`, this
list should have one element per element in `scorer`.
- `convert_predictions`: A named list of functions to transform the output of
`predict(...)` into a vector of predictions. By default, the model's predicted values may not
always be a vector. E.g. `predict(svm_model, probability = TRUE)` returns a
matrix with class probabilities for both classes, while
`predict(svm_model, probability = FALSE)` returns a vector. For calculating
model accuracy you need class predictions, while ROC AUC requires class
probabilities. Suppose that `scorer = list(accuracy = accuracy_vec, auc = roc_auc_vec)`. To ensure that accuracy gets class predictions and ROC AUC
gets class probabilities, you can provide the corresponding prediction arguments
`prediction_args = list(accuracy = NULL, auc = list(probability = TRUE))` and then
convert those predictions into a vector `convert_predictions = list(accuracy = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"])`.
## Cross validation arguments
The following arguments are specific to the `CV` cross validation class.
- `splitter`: This should be a function that intakes the training dataset and
returns a list of cross validation indices. `modeltuning` provides a very simple
function, `cv_split()` that will do the most simple version of this.
- `splitter_args`: A named list of function arguments to pass to `splitter`. E.g.
`cv_split()` has the argument `v` which specifies how many cross validation folds.
For 3 folds, set `splitter_args = list(v = 3)`.
## Grid search arguments
The following arguments are specific to the `GridSearch` grid search class.
- `evaluation_data`: This must be a list containing validation data, e.g.
`list(x = x_eval, y = y_eval)`.
- `optimize_score`: One of `"max"` or `"min"`. Whether to maximize or minimize
the metric defined in `scorer` to find the optimal grid search parameters. **Note**:
if you specify multiple metric functions in `scorer`, modeltune will use the _last_
metric function to find the optimal parameters.
## Examples
We'll show simple examples of each of `CV`, `GridSearch` and `GridSearchCV`.
### CV
```{r CV}
iris_cv <- CV$new(
learner = svm,
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"])
)
# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)
iris_cv_fitted$mean_metrics
```
### GridSearch
```{r GridSearch}
iris_new_train <- iris_new[1:1000, ]
iris_new_eval <- iris_new[1000:nrow(iris_new), ]
iris_grid <- GridSearch$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
evaluation_data = list(x = iris_new_eval[, -5], y = iris_new_eval[, 5]),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
```
### GridSearchCV
```{r Grid Search with CV}
iris_grid <- GridSearchCV$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
```