--- title: "Basic usage" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Basic usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(comment = "#", collapse = TRUE, eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE) ``` The goal of this vignette is to walk through `modeltuning` usage in detail. We'll be training a classification model on the `iris` data-set to predict whether a flower's species is Virginica or not. # Load Packages ```{r Load Packages} library(e1071) library(modeltuning) # devtools::install_github("dmolitor/modeltuning") library(yardstick) ``` ## Data Prep First, let's generate a bunch of synthetic data observations by adding random noise to the original `iris` features and combining it into one big dataframe. ```{r Iris Big} iris_new <- do.call( what = rbind, args = replicate(n = 10, iris, simplify = FALSE) ) |> transform( Sepal.Length = jitter(Sepal.Length, 0.1), Sepal.Width = jitter(Sepal.Width, 0.1), Petal.Length = jitter(Petal.Length, 0.1), Petal.Width = jitter(Petal.Width, 0.1), Species = factor(Species == "virginica") ) # Shuffle the data-set iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ] # Quick overview of the dataset summary(iris_new[, 1:4]) ``` ## Function arguments ### Common arguments The following modeling approach holds for the `CV`, `GridSearch` and `GridSearchCV` classes, which are all very slight variations of each other. Common arguments are as follows: - `learner`: This is where you pass your predictive modeling function; in our case a Support Vector Machine, so `e1071::svm` from the e1071 package. - `scorer`: This is a named list of metric functions that will evaluate the model's predictive performance. Each metric function should have two arguments, `truth` and `estimate` that intake the true outcome values and the predicted outcome values. It should output a scalar numeric score. The [yardstick](https://yardstick.tidymodels.org/) package provides a wide array of these metric functions that should cover most common cases. E.g. for the RMSE of a regression, `scorer = list(rmse = yarstick::rmse_vec)`. - `learner_args`: This is a named list of function arguments that get passed directly to the `learner` function. For example, the `e1071::svm` function takes a `type` argument specifying whether it is a regression or classification task. You could specify a classification task as `learner_args = list(type = "classification")`. - `scorer_args`: This is a named list of function arguments to pass to the scorer functions in `scorer`. This list should have one element per element in `scorer`. E.g. if `scorer = list(rmse = rmse_vec, mae = mae_vec)` then `scorer_args = list(rmse = list(...), mae = list(...))`. - `prediction_args`: This is similar to `learner_args`. It's a named list of function arguments passed to the `predict` method. E.g. our SVM learner has a predict argument called `probability` whether to predict outcome classes or class probabilities. Specify class probabilities as `prediction_args = list(probability = TRUE)`. Similar to `scorer_args`, this list should have one element per element in `scorer`. - `convert_predictions`: A named list of functions to transform the output of `predict(...)` into a vector of predictions. By default, the model's predicted values may not always be a vector. E.g. `predict(svm_model, probability = TRUE)` returns a matrix with class probabilities for both classes, while `predict(svm_model, probability = FALSE)` returns a vector. For calculating model accuracy you need class predictions, while ROC AUC requires class probabilities. Suppose that `scorer = list(accuracy = accuracy_vec, auc = roc_auc_vec)`. To ensure that accuracy gets class predictions and ROC AUC gets class probabilities, you can provide the corresponding prediction arguments `prediction_args = list(accuracy = NULL, auc = list(probability = TRUE))` and then convert those predictions into a vector `convert_predictions = list(accuracy = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"])`. ## Cross validation arguments The following arguments are specific to the `CV` cross validation class. - `splitter`: This should be a function that intakes the training dataset and returns a list of cross validation indices. `modeltuning` provides a very simple function, `cv_split()` that will do the most simple version of this. - `splitter_args`: A named list of function arguments to pass to `splitter`. E.g. `cv_split()` has the argument `v` which specifies how many cross validation folds. For 3 folds, set `splitter_args = list(v = 3)`. ## Grid search arguments The following arguments are specific to the `GridSearch` grid search class. - `evaluation_data`: This must be a list containing validation data, e.g. `list(x = x_eval, y = y_eval)`. - `optimize_score`: One of `"max"` or `"min"`. Whether to maximize or minimize the metric defined in `scorer` to find the optimal grid search parameters. **Note**: if you specify multiple metric functions in `scorer`, modeltune will use the _last_ metric function to find the optimal parameters. ## Examples We'll show simple examples of each of `CV`, `GridSearch` and `GridSearchCV`. ### CV ```{r CV} iris_cv <- CV$new( learner = svm, learner_args = list(type = "C-classification", probability = TRUE), splitter = cv_split, splitter_args = list(v = 3), scorer = list(roc_auc = roc_auc_vec), prediction_args = list(roc_auc = list(probability = TRUE)), convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]) ) # Fit cross validated model iris_cv_fitted <- iris_cv$fit(formula = Species ~ ., data = iris_new) iris_cv_fitted$mean_metrics ``` ### GridSearch ```{r GridSearch} iris_new_train <- iris_new[1:1000, ] iris_new_eval <- iris_new[1000:nrow(iris_new), ] iris_grid <- GridSearch$new( learner = svm, tune_params = list( cost = c(0.01, 0.1, 0.5, 1, 3, 6), kernel = c("polynomial", "radial", "sigmoid") ), learner_args = list(type = "C-classification", probability = TRUE), evaluation_data = list(x = iris_new_eval[, -5], y = iris_new_eval[, 5]), scorer = list(roc_auc = roc_auc_vec), prediction_args = list(roc_auc = list(probability = TRUE)), convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]), optimize_score = "max" ) # Fit cross validated model iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new) iris_grid_fitted$best_params ``` ### GridSearchCV ```{r Grid Search with CV} iris_grid <- GridSearchCV$new( learner = svm, tune_params = list( cost = c(0.01, 0.1, 0.5, 1, 3, 6), kernel = c("polynomial", "radial", "sigmoid") ), learner_args = list(type = "C-classification", probability = TRUE), splitter = cv_split, splitter_args = list(v = 3), scorer = list(roc_auc = roc_auc_vec), prediction_args = list(roc_auc = list(probability = TRUE)), convert_predictions = list(roc_auc = function(.x) attr(.x, "probabilities")[, "FALSE"]), optimize_score = "max" ) # Fit cross validated model iris_grid_fitted <- iris_grid$fit(formula = Species ~ ., data = iris_new) iris_grid_fitted$best_params ```