The goal of this vignette is to walk through modeltuning
usage in detail. We’ll be training a classification model on the
iris data-set to predict whether a flower’s species is
Virginica or not.
library(e1071)
library(modeltuning) # devtools::install_github("dmolitor/modeltuning")
library(yardstick)First, let’s generate a bunch of synthetic data observations by
adding random noise to the original iris features and
combining it into one big dataframe.
iris_new <- do.call(
what = rbind,
args = replicate(n = 10, iris, simplify = FALSE)
) |>
transform(
Sepal.Length = jitter(Sepal.Length, 0.1),
Sepal.Width = jitter(Sepal.Width, 0.1),
Petal.Length = jitter(Petal.Length, 0.1),
Petal.Width = jitter(Petal.Width, 0.1),
Species = factor(Species == "virginica")
)
# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]
# Quick overview of the dataset
summary(iris_new[, 1:4])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.298 Min. :1.998 Min. :0.9983 Min. :0.09802
# 1st Qu.:5.100 1st Qu.:2.799 1st Qu.:1.5984 1st Qu.:0.29982
# Median :5.799 Median :3.001 Median :4.3500 Median :1.30119
# Mean :5.843 Mean :3.057 Mean :3.7580 Mean :1.19931
# 3rd Qu.:6.400 3rd Qu.:3.302 3rd Qu.:5.1000 3rd Qu.:1.80069
# Max. :7.902 Max. :4.401 Max. :6.9014 Max. :2.50172The following modeling approach holds for the CV,
GridSearch and GridSearchCV classes, which are
all very slight variations of each other. Common arguments are as
follows:
learner: This is where you pass your predictive
modeling function; in our case a Support Vector Machine, so
e1071::svm from the e1071 package.
scorer: This is a named list of metric functions
that will evaluate the model’s predictive performance. Each metric
function should have two arguments, truth and
estimate that intake the true outcome values and the
predicted outcome values. It should output a scalar numeric score. The
yardstick package
provides a wide array of these metric functions that should cover most
common cases. E.g. for the RMSE of a regression,
scorer = list(rmse = yarstick::rmse_vec).
learner_args: This is a named list of function
arguments that get passed directly to the learner function.
For example, the e1071::svm function takes a
type argument specifying whether it is a regression or
classification task. You could specify a classification task as
learner_args = list(type = "classification").
scorer_args: This is a named list of function
arguments to pass to the scorer functions in scorer. This
list should have one element per element in scorer. E.g. if
scorer = list(rmse = rmse_vec, mae = mae_vec) then
scorer_args = list(rmse = list(...), mae = list(...)).
prediction_args: This is similar to
learner_args. It’s a named list of function arguments
passed to the predict method. E.g. our SVM learner has a
predict argument called probability whether to predict
outcome classes or class probabilities. Specify class probabilities as
prediction_args = list(probability = TRUE). Similar to
scorer_args, this list should have one element per element
in scorer.
convert_predictions: A named list of functions to
transform the output of predict(...) into a vector of
predictions. By default, the model’s predicted values may not always be
a vector. E.g. predict(svm_model, probability = TRUE)
returns a matrix with class probabilities for both classes, while
predict(svm_model, probability = FALSE) returns a vector.
For calculating model accuracy you need class predictions, while ROC AUC
requires class probabilities. Suppose that
scorer = list(accuracy = accuracy_vec, auc = roc_auc_vec).
To ensure that accuracy gets class predictions and ROC AUC gets class
probabilities, you can provide the corresponding prediction arguments
prediction_args = list(accuracy = NULL, auc = list(probability = TRUE))
and then convert those predictions into a vector
convert_predictions = list(accuracy = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"]).
The following arguments are specific to the CV cross
validation class.
splitter: This should be a function that intakes the
training dataset and returns a list of cross validation indices.
modeltuning provides a very simple function,
cv_split() that will do the most simple version of
this.
splitter_args: A named list of function arguments to
pass to splitter. E.g. cv_split() has the
argument v which specifies how many cross validation folds.
For 3 folds, set splitter_args = list(v = 3).
The following arguments are specific to the GridSearch
grid search class.
evaluation_data: This must be a list containing
validation data, e.g.
list(x = x_eval, y = y_eval).
optimize_score: One of "max" or
"min". Whether to maximize or minimize the metric defined
in scorer to find the optimal grid search parameters.
Note: if you specify multiple metric functions in
scorer, modeltune will use the last metric
function to find the optimal parameters.
We’ll show simple examples of each of CV,
GridSearch and GridSearchCV.
iris_cv <- CV$new(
learner = svm,
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"])
)
# Fit cross validated model
iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new)
iris_cv_fitted$mean_metrics
# $roc_auc
# [1] 0.9984446iris_new_train <- iris_new[1:1000, ]
iris_new_eval <- iris_new[1000:nrow(iris_new), ]
iris_grid <- GridSearch$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
evaluation_data = list(x = iris_new_eval[, -5], y = iris_new_eval[, 5]),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
# $cost
# [1] 6
#
# $kernel
# [1] "polynomial"iris_grid <- GridSearchCV$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(type = "C-classification", probability = TRUE),
splitter = cv_split,
splitter_args = list(v = 3),
scorer = list(roc_auc = roc_auc_vec),
prediction_args = list(roc_auc = list(probability = TRUE)),
convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]),
optimize_score = "max"
)
# Fit cross validated model
iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new)
iris_grid_fitted$best_params
# $cost
# [1] 6
#
# $kernel
# [1] "polynomial"