This vignette will walk through an example of scaling a
modeltuning analysis with AWS via the Paws SDK. The following
analysis depends on AWS credentials passed via environment variables. To
see a detailed outline of the different ways to set AWS credentials,
check out this
how-to document.
We’ll be training a classification model on the iris
data-set to predict whether a flower’s species is virginica or not.
First, let’s generate a bunch of synthetic data observations by
adding random noise to the original iris features and
combining it into one big dataframe.
iris_new <- do.call(
what = rbind,
args = replicate(n = 10, iris, simplify = FALSE)
) |>
transform(
Sepal.Length = jitter(Sepal.Length, 0.1),
Sepal.Width = jitter(Sepal.Width, 0.1),
Petal.Length = jitter(Petal.Length, 0.1),
Petal.Width = jitter(Petal.Width, 0.1),
Species = factor(Species == "virginica")
)
# Shuffle the data-set
iris_new <- iris_new[sample(1:nrow(iris_new), nrow(iris_new)), ]
# Quick overview of the dataset
summary(iris_new[, 1:4])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# Min. :4.298 Min. :1.998 Min. :0.9981 Min. :0.09805
# 1st Qu.:5.100 1st Qu.:2.799 1st Qu.:1.5982 1st Qu.:0.29973
# Median :5.799 Median :3.001 Median :4.3497 Median :1.30097
# Mean :5.843 Mean :3.057 Mean :3.7580 Mean :1.19928
# 3rd Qu.:6.401 3rd Qu.:3.302 3rd Qu.:5.1002 3rd Qu.:1.80084
# Max. :7.902 Max. :4.402 Max. :6.9015 Max. :2.50193Now that we’ve got the data prepped, let’s specify our predictive
modeling approach. For this analysis I’m going to train a Support Vector
classifier using the e1071 package, and I’m going to use
Grid Search in combination with 5-fold Cross-Validation to find the
optimal values for the cost and kernel
hyper-parameters.
# Create a splitter function that will return CV folds
splitter_fn <- function(data) lapply(vfold_cv(data, v = 5)$splits, \(y) y$in_id)
iris_grid <- GridSearchCV$new(
learner = svm,
tune_params = list(
cost = c(0.01, 0.1, 0.5, 1, 3, 6),
kernel = c("polynomial", "radial", "sigmoid")
),
learner_args = list(
scale = TRUE,
type = "C-classification",
probability = TRUE
),
splitter = splitter_fn,
scorer = list(
accuracy = accuracy_vec,
f_measure = f_meas_vec,
auc = roc_auc_vec
),
prediction_args = list(
accuracy = NULL,
f_measure = NULL,
auc = list(probability = TRUE)
),
convert_predictions = list(
accuracy = NULL,
f_measure = NULL,
auc = function(.x) attr(.x, "probabilities")[, "FALSE"]
),
optimize_score = "max"
)Now that we’ve specified our Grid Search schema let’s check out the hyper-parameter grid and see how many models we’re going to estimate.
To speed up the estimation of our models, let’s create a remote cluster of 6 worker nodes to estimate the models in parallel.
First, we will launch 6 instances using a custom AMI that contains R and a bunch of essential R packages. While this AMI is not available as a community AMI there are definitely good AMIs out there that have a comprehensive set of R packages and corresponding tools installed. Note: which parameters you need to specify when launching EC2 instances may vary greatly depending on your account’s security configurations.
ec2_client <- ec2()
# Request Instances
instance_req <- ec2_client$run_instances(
ImageId = "ami-06dd49fc9e3a5acee",
InstanceType = "t2.large",
KeyName = key_name,
MaxCount = 6,
MinCount = 6,
InstanceInitiatedShutdownBehavior = "terminate",
SecurityGroupIds = security_group,
# This names the instances
TagSpecifications = list(
list(
ResourceType = "instance",
Tags = list(
list(
Key = "Name",
Value = "Worker Node"
)
)
)
)
)Now that we’ve launched the instances we need to wait until they all
respond as "running" before we try to do anything (We also
need to wait for ~ 1 minute for the instances to initialize or they’ll
reject our SSH login attempts).
# Chalk up a quick function to return instance IDs from our request
instance_ids <- function(response) {
vapply(response$Instances, function(i) i$InstanceId, character(1))
}
# Wait for instances to all respond as 'running'
while(
!all(
vapply(
ec2_client$
describe_instances(InstanceIds = instance_ids(instance_req))$
Reservations[[1]]$
Instances,
function(i) i$State$Name,
character(1)
) == "running"
)
) {
Sys.sleep(5)
}
# Rough heuristic -- give additional 45 seconds for instances to initialize
Sys.sleep(45)Now, in order to set up our compute cluster we need to get the IP addresses from these instances.
# Get public IPs
inst_public_ips <- vapply(
ec2_client$
describe_instances(InstanceIds = instance_ids(instance_req))$
Reservations[[1]]$
Instances,
function(i) i$PublicIpAddress,
character(1)
)Finally, we can create a compute cluster on these worker nodes via SSH.
Now that we’ve created our compute cluster, we can use the
future package to specify our parallelization plan. Since
modeltuning is built on top of the future
framework, it will automatically parallelize the model estimation across
our 6-worker cluster. The following parallelization
topology basically is telling future to
parallelize the grid-search models across the compute cluster, and to
parallelize each model’s cross-validation across the cores of the
instance it is being evaluated on.
Finally, let’s estimate our Grid Search models in parallel!
Let’s check out the info on our best model.
best_idx <- iris_grid_fitted$best_idx
metrics <- iris_grid_fitted$metrics
# Print model metrics of best model
cat(
" Accuracy:", round(100 * metrics$accuracy[[best_idx]], 2),
"%\nF-Measure:", round(100 * metrics$f_measure[[best_idx]], 2),
"%\n AUC:", round(metrics$auc[[best_idx]], 4), "\n"
)
# Accuracy: 98.4 %
# F-Measure: 98.8 %
# AUC: 0.9993
params <- iris_grid_fitted$best_params
# Print the best hyper-parameters
cat(
" Optimal Cost:", params[["cost"]],
"\nOptimal Kernel:", params[["kernel"]], "\n"
)
# Optimal Cost: 6
# Optimal Kernel: polynomial