--- title: "Scaling with AWS" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Scaling with AWS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(comment = "#", collapse = TRUE, eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE) ``` This vignette will walk through an example of scaling a `modeltuning` analysis with AWS via the [Paws SDK](https://github.com/paws-r/paws). The following analysis depends on AWS credentials passed via environment variables. To see a detailed outline of the different ways to set AWS credentials, check out [this how-to document](https://github.com/paws-r/paws/blob/main/docs/credentials.md). # Requisite Packages ```{r Requisite Packages} library(e1071) library(future) library(modeltuning) # devtools::install_github("dmolitor/modeltuning") library(parallelly) library(paws) library(rsample) library(yardstick) ``` # Data and Model Prep ## Data Prep We'll be training a classification model on the `iris` data-set to predict whether a flower's species is virginica or not. First, let's generate a bunch of synthetic data observations by adding random noise to the original `iris` features and combining it into one big dataframe. ```{r Iris Big} iris_new <- do.call( what = rbind, args = replicate(n = 10, iris, simplify = FALSE) ) |> transform( Sepal.Length = jitter(Sepal.Length, 0.1), Sepal.Width = jitter(Sepal.Width, 0.1), Petal.Length = jitter(Petal.Length, 0.1), Petal.Width = jitter(Petal.Width, 0.1), Species = factor(Species == "virginica") ) # Shuffle the data-set iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ] # Quick overview of the dataset summary(iris_new[, 1:4]) ``` ## Grid Search Specification Now that we've got the data prepped, let's specify our predictive modeling approach. For this analysis I'm going to train a Support Vector classifier using the `e1071` package, and I'm going to use Grid Search in combination with 5-fold Cross-Validation to find the optimal values for the `cost` and `kernel` hyper-parameters. ```{r Grid Search} # Create a splitter function that will return CV folds splitter_fn <- function(data) lapply(vfold_cv(data, v = 5)$splits, \(y) y$in_id) iris_grid <- GridSearchCV$new( learner = svm, tune_params = list( cost = c(0.01, 0.1, 0.5, 1, 3, 6), kernel = c("polynomial", "radial", "sigmoid") ), learner_args = list( scale = TRUE, type = "C-classification", probability = TRUE ), splitter = splitter_fn, scorer = list( accuracy = accuracy_vec, f_measure = f_meas_vec, auc = roc_auc_vec ), prediction_args = list( accuracy = NULL, f_measure = NULL, auc = list(probability = TRUE) ), convert_predictions = list( accuracy = NULL, f_measure = NULL, auc = function(.x) attr(.x, "probabilities")[, "FALSE"] ), optimize_score = "max" ) ``` Now that we've specified our Grid Search schema let's check out the hyper-parameter grid and see how many models we're going to estimate. ```{r N-Models} cat("We will estimate", nrow(iris_grid$tune_params), "SVM models\n") ``` # Launch AWS Resources To speed up the estimation of our models, let's create a remote cluster of 6 worker nodes to estimate the models in parallel. ## Launch EC2 Instances First, we will launch 6 instances using a custom AMI that contains R and a bunch of essential R packages. While this AMI is not available as a community AMI there are definitely good AMIs out there that have a comprehensive set of R packages and corresponding tools installed. **Note:** which parameters you need to specify when launching EC2 instances may vary greatly depending on your account's security configurations. ```{r Launch Instances, eval = FALSE} ec2_client <- ec2() # Request Instances instance_req <- ec2_client$run_instances( ImageId = "ami-06dd49fc9e3a5acee", InstanceType = "t2.large", KeyName = key_name, MaxCount = 6, MinCount = 6, InstanceInitiatedShutdownBehavior = "terminate", SecurityGroupIds = security_group, # This names the instances TagSpecifications = list( list( ResourceType = "instance", Tags = list( list( Key = "Name", Value = "Worker Node" ) ) ) ) ) ``` Now that we've launched the instances we need to wait until they all respond as `"running"` before we try to do anything (We also need to wait for ~ 1 minute for the instances to initialize or they'll reject our SSH login attempts). ```{r Wait for Instances, eval = FALSE} # Chalk up a quick function to return instance IDs from our request instance_ids <- function(response) { vapply(response$Instances, function(i) i$InstanceId, character(1)) } # Wait for instances to all respond as 'running' while( !all( vapply( ec2_client$ describe_instances(InstanceIds = instance_ids(instance_req))$ Reservations[[1]]$ Instances, function(i) i$State$Name, character(1) ) == "running" ) ) { Sys.sleep(5) } # Rough heuristic -- give additional 45 seconds for instances to initialize Sys.sleep(45) ``` ## Create Cluster Now, in order to set up our compute cluster we need to get the IP addresses from these instances. ```{r IPs, eval = FALSE} # Get public IPs inst_public_ips <- vapply( ec2_client$ describe_instances(InstanceIds = instance_ids(instance_req))$ Reservations[[1]]$ Instances, function(i) i$PublicIpAddress, character(1) ) ``` Finally, we can create a compute cluster on these worker nodes via SSH. ```{r Compute Cluster, eval = FALSE} cl <- makeClusterPSOCK( worker = inst_public_ips, user = "ubuntu", rshopts = c("-o", "StrictHostKeyChecking=no", "-o", "IdentitiesOnly=yes", "-i", pem_fp), # Local filepath to private SSH key-pair connectTimeout = 25, tries = 3 ) ``` # Estimate Models Now that we've created our compute cluster, we can use the `future` package to specify our parallelization plan. Since `modeltuning` is built on top of the `future` framework, it will automatically parallelize the model estimation across our 6-worker cluster. The following parallelization __topology__ basically is telling `future` to parallelize the grid-search models across the compute cluster, and to parallelize each model's cross-validation across the cores of the instance it is being evaluated on. ```{r Parallel plan, eval = FALSE} plan( list( tweak(cluster, workers = cl), multisession ) ) ``` Finally, let's estimate our Grid Search models in parallel! ```{r Estimate Models} iris_grid_fitted <- iris_grid$fit( formula = Species ~ ., data = iris_new, progress = TRUE ) ``` # Best Model/Parameters Let's check out the info on our best model. ```{r Best Model Info} best_idx <- iris_grid_fitted$best_idx metrics <- iris_grid_fitted$metrics # Print model metrics of best model cat( " Accuracy:", round(100 * metrics$accuracy[[best_idx]], 2), "%\nF-Measure:", round(100 * metrics$f_measure[[best_idx]], 2), "%\n AUC:", round(metrics$auc[[best_idx]], 4), "\n" ) params <- iris_grid_fitted$best_params # Print the best hyper-parameters cat( " Optimal Cost:", params[["cost"]], "\nOptimal Kernel:", params[["kernel"]], "\n" ) ``` # Kill AWS Resources Now that we've completed our mini-analysis let's make sure to kill all our AWS resources. Since all we've done is launch EC2 instances, all this consists of is making sure that the instances are all shut down. ```{r Kill Instances, eval = FALSE} ec2_client$stop_instances( InstanceIds = instance_ids(instance_req) ) ```