--- title: "mlexperiments: Getting Started" vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{mlexperiments: Getting Started} %\VignetteEngine{quarto::html} editor_options: chunk_output_type: console execute: eval: false collapse: true comment: "#>" --- ```{r, include = FALSE} # nolint start ``` The goal of the package `mlexperiments` is to provide an extensible framework for reproducible machine learning experiments, namely: * Hyperparameter tuning: with the R6 class `mlexperiments::MLTuneParameters`, to optimize the hyperparameters in a k-fold cross-validation with one of the two strategies + Grid search + Bayesian optimization (using the [`ParBayesianOptimization`](https://github.com/AnotherSamWilson/ParBayesianOptimization) R package) * K-fold Cross-validation (CV): with the R6 class `mlexperiments::MLCrossValidation`, to validate one hyperparameter setting * Nested k-fold cross validation: with the R6 class `mlexperiments::MLNestedCV`, which basically combines the two experiments above to perform a hyperparameter optimization on an inner CV loop, and to validate the best hyperparameter setting on an outer CV loop The package provides a minimal shell for these experiments, and - with few adjustments - users can prepare different learner algorithms so that they can be used with `mlexperiments`. This vignette will go through the steps that are necessary to prepare a new learner. ## General Overview In general, the learner class exposes 4 methods that can be defined: * `$fit` A wrapper around the private function `fun_fit`, which needs to be defined for every learner. The return value of this function is the fitted model. * `$predict` A wrapper around the private function `fun_predict`, which needs to be defined for every learner. The function must accept the three arguments `model`, `newdata`, and `ncores` and is a wrapper around the respective learner's predict-function. In order to allow the passing of further arguments, the ellipsis (`...`) can be used. The function should return the prediction results. * `$cross_validation` A wrapper around the private function `fun_optim_cv`, which needs to be defined when hyperparameters should be optimized with a grid search (required for use with `mlexperiments::MLTuneParameters`, and `mlexperiments::MLNestedCV`). * `$bayesian_scoring_function` A wrapper around the private function `fun_bayesian_scoring_function`, which needs to be defined when hyperparameters should be optimized with a Bayesian process (required for use with `mlexperiments::MLTuneParameters`, and `mlexperiments::MLNestedCV`). In the following, we will go through the steps to prepare the algorithm '`class::knn()`' to be used with `mlexperiments` (the same code is also implemented in the package and ready to use as `mlexperiments::LearnerKnn`). ## Steps to Prepare an Algorithm for Use with `mlexperiments` ### The `fit` Method This method must take the arguments `x`, `y`, `ncores`, `seed`, as well as the ellipsis (`...`), while arguments to parameterize the learner are to be passed to the function with the latter. The `fit` method should include one call to fit a model of the algorithm and it should finally return the fitted model. ```{r eval=FALSE} knn_fit <- function(x, y, ncores, seed, ...) { kwargs <- list(...) stopifnot("k" %in% names(kwargs)) args <- kdry::list.append(list(train = x, cl = y), kwargs) args$prob <- TRUE set.seed(seed) fit <- do.call(class::knn, args) return(fit) } ``` ### The `predict` Method This method must take the arguments `model`, `newdata`, `ncores`, and the ellipsis (`...`). It is a wrapper around the respective algorithm's `predict()` function, while specific arguments required to parameterize it can be passed with the ellipsis. The experiments `mlexperiments::MLCrossValidation` and `mlexperiments::MLNestedCV` do both have the field `$predict_args` to define a list that is further passed on to the `predcit` method's ellipsis. In contrast, when it is required to further parameterize this method during the hyperparameter tuning (`mlexperiments::MLTuneParameters`), it is required to define those parameters within the `cross_validation` method (see below). The returned value of the `predict` method should be a vector with the predictions. ```{r eval=FALSE} knn_predict <- function(model, newdata, ncores, ...) { kwargs <- list(...) stopifnot("type" %in% names(kwargs)) if (kwargs$type == "response") { return(model) } else if (kwargs$type == "prob") { # there is no knn-model but the probabilities predicted for the test data return(attributes(model)$prob) } } ``` The implementation of `class::knn()` is in some ways special and different from the implementation of other algorithms. One of these peculiarities is that `class::knn()` does not return a fitted model but instead returns the predicted values directly. Depending on the value of the argument `prob`, these results also include the probability values of the predicted classes. ### The `cross_validation` Method The purpose of this function is to perform a k-fold cross validation for one specific hyperparameter setting. The function must take the arguments `x`, `y`, `params` (a list of hyperparameters), `fold_list` (to define the cross-validation folds), `ncores`, and `seed`. Finally, the function must return a named list with at least one item called `metric_optim_mean`, which contains the cross validated error metric. ```{r eval=FALSE} knn_optimization <- function(x, y, params, fold_list, ncores, seed) { stopifnot(is.list(params), "k" %in% names(params)) # initialize a dataframe to store the results results_df <- data.table::data.table( "fold" = character(0), "metric" = numeric(0) ) # we do not need test here as it is defined explicitly below params[["test"]] <- NULL # loop over the folds for (fold in names(fold_list)) { # get row-ids of the current fold train_idx <- fold_list[[fold]] # create learner-arguments args <- kdry::list.append( list( x = kdry::mlh_subset(x, train_idx), test = kdry::mlh_subset(x, -train_idx), y = kdry::mlh_subset(y, train_idx), use.all = FALSE, ncores = ncores, seed = seed ), params ) set.seed(seed) cvfit <- do.call(knn_fit, args) # optimize error rate FUN <- metric("ce") # nolint err <- FUN(predictions = knn_predict( model = cvfit, newdata = kdry::mlh_subset(x, -train_idx), ncores = ncores, type = "response" ), ground_truth = kdry::mlh_subset(y, -train_idx) ) results_df <- data.table::rbindlist( l = list(results_df, list("fold" = fold, "validation_metric" = err)), fill = TRUE ) } res <- list("metric_optim_mean" = mean(results_df$validation_metric)) return(res) } ``` ### The `bayesian_scoring_function` Method This function can be thought of as a "gatekeeper" that takes a new suggested hyperparameter configuration from the Bayesian process and forwards this configuration further on to a call of the `cross_validation` method (see above) in order to evaluate this specific setting. However, some peculiarities must be considered in this regard: 1. The functions needs to take the hyperparameters that should be optimized as function arguments (I generally use the ellipsis (`...`), however, the hyperparameters can also be defined as arguments explicitly). 2. When using the `strategy = "bayesian"`, the package is configured in a way that the Bayesian process is parallelized, hence parallel threads evaluate different hyperparameter settings simultaneously (see [`ParBayesianOptimization's Readme`](https://github.com/AnotherSamWilson/ParBayesianOptimization#running-in-parallel) for more details). Therefore, the call to the `cross_validation` method must explicitly specify `ncores = 1L` in order to no get in trouble with requesting more resources than available, when using the `strategy = "bayesian"`. 3. The value returned from the Bayesian scoring function must be a named list that contains the optimization metric as the item `Score`. As described above, the returned value from `cross_validation` is already a named list that contains the optimization metric with the item `metric_optim_mean`. As this item is required later on internally for the `mlexperiments` package , the value value of this item is just copied and saved under the new name "Score" to address the requirements of `ParBayesianOptimization`. Note: please notice that `mlexperiments` already takes care of the direction of the optimization metric, which is handled depending on the learner's initialization argument `metric_optimization_higher_better`, so no changes should be made here to ensure a correct functioning. ```{r eval=FALSE} knn_bsF <- function(...) { # nolint params <- list(...) # call to knn_optimization here with ncores = 1, since the Bayesian search # is parallelized already / "FUN is fitted n times in m threads" set.seed(seed)#, kind = "L'Ecuyer-CMRG") bayes_opt_knn <- knn_optimization( x = x, y = y, params = params, fold_list = method_helper$fold_list, ncores = 1L, # important, as bayesian search is already parallelized seed = seed ) ret <- kdry::list.append( list("Score" = bayes_opt_knn$metric_optim_mean), bayes_opt_knn ) return(ret) } ``` More details on the package [`ParBayesianOptimization`](https://cran.r-project.org/package=ParBayesianOptimization) and on how to define the Bayesian scoring function can be found in its [package vignette](https://cran.r-project.org/package=ParBayesianOptimization/vignettes/tuningHyperparameters.html#practical-example). For the parallelization of the Bayesian process, all required functions must be exported to the cluster. To facilitate this, a simple wrapper function can be created that returns a character vector of all custom functions that are called from within the Bayesian scoring function. The following function shows the objects that need to be exported for a correct functioning of the `LearnerKnn`: ```{r eval=FALSE} # define the objects / functions that need to be exported to each cluster # for parallelizing the Bayesian optimization. knn_ce <- function() { c("knn_optimization", "knn_fit", "knn_predict", "metric", ".format_xy") } ``` ### Finally, Create an R6 Class for the Learner Finally, all of these created functions need to be integrated into a learner object. This is basically done by overwriting the placeholders in an R6 learner that inherits from `mlexperiments::MLLearnerBase`. The placeholders are: | Name | Type | Description | | ---- | ---- | ----------- | | `private$fun_fit` | function | A function to fit a model of the respective algorithm. The function must return the fitted model. | | `private$fun_predict` | function | A function to predict the outcome in new data. The returned value of the `predict` method should be a vector with the predictions. | | `private$fun_optim_cv` | function | A function to perform a k-fold cross-validation for one hyperparameter setting. The function must return a named list with at least one item called `metric_optim_mean`, which contains the cross validated error metric. | | `private$fun_bayesian_scoring_function` | function | A function that is defined according to the requirements of the [`ParBayesianOptimization`](https://cran.r-project.org/package=ParBayesianOptimization) R package. It must return a named list that contains the optimization metric as the item `Score`. | | `self$environment` | field | The environment, where to search for the objects that need to be exported to a parallel cluster (required for Bayesian optimization). When the R6 learner is part of an R package, you can write the name of the R package here. Otherwise, `-1L` (the global environment) might be suitable as long as all objects that are defined in the field `cluster_export` are available from the global environment. | | `self$cluster_export` | field | A character vector with the names of objects that need to be exported to each node of a parallel cluster when performing a Bayesian optimization. | These assignments should be done in the `initialize()` function. The following code example shows the assignment of the previously created functions to the respective functions and fields of the newly created R6 class `LearnerKnn`: ```{r eval=FALSE} LearnerKnn <- R6::R6Class( # nolint classname = "LearnerKnn", inherit = mlexperiments::MLLearnerBase, public = list( initialize = function() { if (!requireNamespace("class", quietly = TRUE)) { stop( paste0( "Package \"class\" must be installed to use ", "'learner = \"LearnerKnn\"'." ), call. = FALSE ) } super$initialize( metric_optimization_higher_better = FALSE # classification error ) private$fun_fit <- knn_fit private$fun_predict <- knn_predict private$fun_optim_cv <- knn_optimization private$fun_bayesian_scoring_function <- knn_bsF self$environment <- "mlexperiments" self$cluster_export <- knn_ce() } ) ) ``` Please note that `metric_optimization_higher_better` is defaulted to `FALSE` here when initializing the super-class. This is because of choosing the error rate as the optimization metric (`FUN <- metric("ce")`) when defining the `cross_validation`-function above. ## Examples Now, the learner is put together and ready to be used with `mlexperiments`: ### Preparations First of all, load the data and transform it into a matrix, and define the training data and the target variable. ```{r} library(mlexperiments) library(mlbench) data("DNA") dataset <- DNA |> data.table::as.data.table() |> na.omit() seed <- 123 feature_cols <- colnames(dataset)[1:180] train_x <- model.matrix( ~ -1 + ., dataset[, .SD, .SDcols = feature_cols] ) train_y <- dataset[, get("Class")] ncores <- ifelse( test = parallel::detectCores() > 4, yes = 4L, no = ifelse( test = parallel::detectCores() < 2L, yes = 1L, no = parallel::detectCores() ) ) if (isTRUE(as.logical(Sys.getenv("_R_CHECK_LIMIT_CORES_")))) { # on cran ncores <- 2L } ``` ### Hyperparameter Tuning #### Bayesian Tuning For the Bayesian hyperparameter optimization, it is required to define a grid with some hyperparameter combinations that is used for initializing the Bayesian process. Furthermore, the borders (allowed extreme values) of the hyperparameters that are actually optimized need to be defined in a list. Finally, further arguments that are passed to the function `ParBayesianOptimization::bayesOpt()` can be defined as well. ```{r} param_list_knn <- expand.grid( k = seq(4, 68, 8), l = 0, test = parse(text = "fold_test$x") ) knn_bounds <- list(k = c(2L, 80L)) optim_args <- list( iters.n = ncores, kappa = 3.5, acq = "ucb" ) ``` Here, another peculiarity of `class::knn()` is visible: when fitting a model, one needs to specify the argument `test` in order to specify a matrix of test set cases. In order to have the correct test set cases selected throughout the cross-validation, one needs to specify argument as an expression, which is then evaluated before passing the arguments on to the `fit`-function. Generally speaking, this is a feature implemented in `mlexperiments`: when specifying an expression as a learner argument (either via the R6 classes' fields `learner_args` or `parameter_grid`), this expression is evaluated before passing the argument list on the fitting functions. In order to execute the parameter tuning, the created objects need to be assigned to the corresponding fields of the R6 class `mlexperiments::MLTuneParameters`: ```{r} knn_tune_bayesian <- mlexperiments::MLTuneParameters$new( learner = LearnerKnn$new(), strategy = "bayesian", ncores = ncores, seed = seed ) knn_tune_bayesian$parameter_bounds <- knn_bounds knn_tune_bayesian$parameter_grid <- param_list_knn knn_tune_bayesian$split_type <- "stratified" knn_tune_bayesian$optim_args <- optim_args # set data knn_tune_bayesian$set_data( x = train_x, y = train_y ) results <- knn_tune_bayesian$execute(k = 3) #> #> Registering parallel backend using 4 cores. head(results) #> Epoch setting_id k gpUtility acqOptimum inBounds Elapsed Score metric_optim_mean errorMessage l #> 1: 0 1 4 NA FALSE TRUE 2.153 -0.2247332 0.2247332 NA 0 #> 2: 0 2 12 NA FALSE TRUE 2.274 -0.1600753 0.1600753 NA 0 #> 3: 0 3 20 NA FALSE TRUE 2.006 -0.1381042 0.1381042 NA 0 #> 4: 0 4 28 NA FALSE TRUE 2.329 -0.1403013 0.1403013 NA 0 #> 5: 0 5 36 NA FALSE TRUE 2.109 -0.1315129 0.1315129 NA 0 #> 6: 0 6 44 NA FALSE TRUE 2.166 -0.1258632 0.1258632 NA 0 ``` #### Grid Search To carry out the hyperparameter optimization with a grid search, only the `parameter_grid` is required: ```{r} knn_tune_grid <- mlexperiments::MLTuneParameters$new( learner = LearnerKnn$new(), strategy = "grid", ncores = ncores, seed = seed ) knn_tune_grid$parameter_grid <- param_list_knn knn_tune_grid$split_type <- "stratified" # set data knn_tune_grid$set_data( x = train_x, y = train_y ) results <- knn_tune_grid$execute(k = 3) #> #> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%) #> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%) #> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%) #> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%) #> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%) #> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%) #> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%) #> Parameter settings [=================================================================================================] 9/9 (100%) head(results) #> setting_id metric_optim_mean k l #> 1: 1 0.2187696 4 0 #> 2: 2 0.1597615 12 0 #> 3: 3 0.1349655 20 0 #> 4: 4 0.1406152 28 0 #> 5: 5 0.1318267 36 0 #> 6: 6 0.1258632 44 0 ``` ### Cross-Validation For the cross-validation experiments (`mlexperiments::MLCrossValidation`, and `mlexperiments::MLNestedCV`), a named list with the in-sample row indices of the folds is required. ```{r} fold_list <- splitTools::create_folds( y = train_y, k = 3, type = "stratified", seed = seed ) str(fold_list) #> List of 3 #> $ Fold1: int [1:2124] 1 2 3 4 5 7 9 10 11 12 ... #> $ Fold2: int [1:2124] 1 2 3 6 8 9 11 13 16 17 ... #> $ Fold3: int [1:2124] 4 5 6 7 8 10 12 14 15 16 ... ``` Furthermore, a specific hyperparameter setting that should be validated with the cross-validation needs to be selected: ```{r} knn_cv <- mlexperiments::MLCrossValidation$new( learner = LearnerKnn$new(), fold_list = fold_list, seed = seed ) best_grid_result <- knn_tune_grid$results$best.setting best_grid_result #> $setting_id #> [1] 9 #> #> $k #> [1] 68 #> #> $l #> [1] 0 #> #> $test #> expression(fold_test$x) knn_cv$learner_args <- best_grid_result[-1] knn_cv$predict_args <- list(type = "response") knn_cv$performance_metric <- metric("bacc") knn_cv$return_models <- TRUE # set data knn_cv$set_data( x = train_x, y = train_y ) results <- knn_cv$execute() #> #> CV fold: Fold1 #> #> CV fold: Fold2 #> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%) #> #> CV fold: Fold3 #> CV progress [========================================================================================================] 3/3 (100%) #> head(results) #> fold performance k l #> 1: Fold1 0.8912781 68 0 #> 2: Fold2 0.8832388 68 0 #> 3: Fold3 0.8657147 68 0 ``` ### Nested Cross-Validation Last but not least, the hyperparameter optimization and validation can be combined in a nested cross-validation. In each fold of the so-called "outer" cross-validation loop, the hyperparameters are optimized on the in-sample observations with one of the two strategies: Bayesian optimization or grid search. Both of these strategies are implemented again with a "nested" ("inner") cross-validation. The best hyperparameter setting as identified by the inner cross-validation is then used to fit a model with all in-sample observations of the outer cross-validation loop and finally validate it on the respective out-sample observations. The experiment classes must be parameterized as described above. #### Inner Bayesian Optimization ```{r} knn_cv_nested_bayesian <- mlexperiments::MLNestedCV$new( learner = LearnerKnn$new(), strategy = "bayesian", fold_list = fold_list, k_tuning = 3L, ncores = ncores, seed = seed ) knn_cv_nested_bayesian$parameter_grid <- param_list_knn knn_cv_nested_bayesian$parameter_bounds <- knn_bounds knn_cv_nested_bayesian$split_type <- "stratified" knn_cv_nested_bayesian$optim_args <- optim_args knn_cv_nested_bayesian$predict_args <- list(type = "response") knn_cv_nested_bayesian$performance_metric <- metric("bacc") # set data knn_cv_nested_bayesian$set_data( x = train_x, y = train_y ) results <- knn_cv_nested_bayesian$execute() #> #> CV fold: Fold1 #> #> Registering parallel backend using 4 cores. #> #> CV fold: Fold2 #> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%) #> #> Registering parallel backend using 4 cores. #> #> CV fold: Fold3 #> CV progress [========================================================================================================] 3/3 (100%) #> #> Registering parallel backend using 4 cores. head(results) #> fold performance k l #> 1: Fold1 0.8912781 68 0 #> 2: Fold2 0.8832388 68 0 #> 3: Fold3 0.8657147 68 0 ``` #### Inner Grid Search ```{r} knn_cv_nested_grid <- mlexperiments::MLNestedCV$new( learner = LearnerKnn$new(), strategy = "grid", fold_list = fold_list, k_tuning = 3L, ncores = ncores, seed = seed ) knn_cv_nested_grid$parameter_grid <- param_list_knn knn_cv_nested_grid$split_type <- "stratified" knn_cv_nested_grid$predict_args <- list(type = "response") knn_cv_nested_grid$performance_metric <- metric("bacc") # set data knn_cv_nested_grid$set_data( x = train_x, y = train_y ) results <- knn_cv_nested_grid$execute() #> #> CV fold: Fold1 #> #> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%) #> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%) #> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%) #> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%) #> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%) #> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%) #> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%) #> Parameter settings [=================================================================================================] 9/9 (100%) #> CV fold: Fold2 #> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%) #> #> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%) #> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%) #> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%) #> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%) #> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%) #> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%) #> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%) #> Parameter settings [=================================================================================================] 9/9 (100%) #> CV fold: Fold3 #> CV progress [========================================================================================================] 3/3 (100%) #> #> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%) #> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%) #> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%) #> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%) #> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%) #> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%) #> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%) #> Parameter settings [=================================================================================================] 9/9 (100%) head(results) #> fold performance k l #> 1: Fold1 0.8959736 52 0 #> 2: Fold2 0.8832388 68 0 #> 3: Fold3 0.8657147 68 0 ``` ```{r include=FALSE} # nolint end ```