---
title: "Example: Parallel Grid Search for Hyperparameter Tuning"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Example: Parallel Grid Search for Hyperparameter Tuning}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

Hyperparameter tuning is one of the most time-consuming aspects of machine learning. Grid search exhaustively evaluates model performance across parameter combinations. This example demonstrates parallelizing grid search to dramatically reduce tuning time.

**Use Case**: ML model optimization, hyperparameter tuning, model selection, cross-validation

**Computational Pattern**: Embarrassingly parallel model training with aggregation

## The Problem

You need to tune a gradient boosting model for a binary classification task:
- 3 learning rates: 0.01, 0.05, 0.1
- 3 tree depths: 3, 5, 7
- 3 subsample ratios: 0.6, 0.8, 1.0
- 3 minimum child weights: 1, 3, 5

This creates **81 parameter combinations** (3 × 3 × 3 × 3). With 5-fold cross-validation, that's **405 model fits**.

Training each model takes ~5 seconds, resulting in **34 minutes** of sequential computation.

## Setup

```{r setup, eval=FALSE}
library(starburst)
```

## Generate Sample Data

Create a synthetic classification dataset:

```{r data, eval=FALSE}
set.seed(2024)

# Generate features
n_samples <- 10000
n_features <- 20

X <- matrix(rnorm(n_samples * n_features), nrow = n_samples)
colnames(X) <- paste0("feature_", 1:n_features)

# Generate target with non-linear relationship
true_coef <- rnorm(n_features)
linear_pred <- X %*% true_coef
prob <- 1 / (1 + exp(-linear_pred))
y <- rbinom(n_samples, 1, prob)

# Create train/test split
train_idx <- sample(1:n_samples, 0.7 * n_samples)
X_train <- X[train_idx, ]
y_train <- y[train_idx]
X_test <- X[-train_idx, ]
y_test <- y[-train_idx]

cat(sprintf("Dataset created:\n"))
cat(sprintf("  Training samples: %s\n", format(length(y_train), big.mark = ",")))
cat(sprintf("  Test samples: %s\n", format(length(y_test), big.mark = ",")))
cat(sprintf("  Features: %d\n", n_features))
cat(sprintf("  Class balance: %.1f%% / %.1f%%\n",
            mean(y_train) * 100, (1 - mean(y_train)) * 100))
```

**Output**:
```
Dataset created:
  Training samples: 7,000
  Test samples: 3,000
  Features: 20
  Class balance: 50.3% / 49.7%
```

## Define Parameter Grid

Create all hyperparameter combinations:

```{r grid, eval=FALSE}
# Define parameter space
param_grid <- expand.grid(
  learning_rate = c(0.01, 0.05, 0.1),
  max_depth = c(3, 5, 7),
  subsample = c(0.6, 0.8, 1.0),
  min_child_weight = c(1, 3, 5),
  stringsAsFactors = FALSE
)

cat(sprintf("Grid search space:\n"))
cat(sprintf("  Total parameter combinations: %d\n", nrow(param_grid)))
cat(sprintf("  With 5-fold CV: %d model fits\n\n", nrow(param_grid) * 5))
```

**Output**:
```
Grid search space:
  Total parameter combinations: 81
  With 5-fold CV: 405 model fits
```

## Model Training Function

Define a function that trains and evaluates one parameter combination:

```{r train-fn, eval=FALSE}
# Simple gradient boosting implementation (for demonstration)
# In practice, use xgboost, lightgbm, or other optimized libraries
train_gbm <- function(X, y, params, n_trees = 50) {
  # Simplified GBM simulation
  # This is a mock implementation - in real use, call xgboost, etc.

  n <- nrow(X)
  pred <- rep(mean(y), n)  # Initial prediction

  # Simulate training time based on complexity
  complexity_factor <- params$max_depth * (1 / params$learning_rate) *
                      (1 / params$subsample)
  training_time <- 0.001 * complexity_factor * n_trees

  Sys.sleep(min(training_time, 5))  # Cap at 5 seconds

  # Generate mock predictions with some realism
  pred <- pred + rnorm(n, 0, 0.1)
  pred <- pmin(pmax(pred, 0), 1)  # Bound to [0, 1]

  list(predictions = pred, params = params)
}

# Cross-validation function
cv_evaluate <- function(param_row, X_data, y_data, n_folds = 5) {
  params <- as.list(param_row)

  # Create folds
  n <- nrow(X_data)
  fold_size <- floor(n / n_folds)
  fold_indices <- sample(rep(1:n_folds, length.out = n))

  # Perform cross-validation
  cv_scores <- numeric(n_folds)

  for (fold in 1:n_folds) {
    # Split data
    val_idx <- which(fold_indices == fold)
    train_idx <- which(fold_indices != fold)

    X_fold_train <- X_data[train_idx, , drop = FALSE]
    y_fold_train <- y_data[train_idx]
    X_fold_val <- X_data[val_idx, , drop = FALSE]
    y_fold_val <- y_data[val_idx]

    # Train model
    model <- train_gbm(X_fold_train, y_fold_train, params)

    # Predict and evaluate (mock evaluation)
    # In practice, compute actual predictions and metrics
    baseline_accuracy <- mean(y_fold_val == round(mean(y_fold_train)))

    # Simulate performance improvement based on good parameters
    param_quality <- (params$learning_rate >= 0.05) * 0.02 +
                    (params$max_depth >= 5) * 0.02 +
                    (params$subsample >= 0.8) * 0.01 +
                    rnorm(1, 0, 0.02)

    accuracy <- min(baseline_accuracy + param_quality, 1.0)
    cv_scores[fold] <- accuracy
  }

  # Return results
  list(
    params = params,
    mean_cv_score = mean(cv_scores),
    std_cv_score = sd(cv_scores),
    cv_scores = cv_scores
  )
}
```

## Local Execution

Test grid search locally on a small subset:

```{r local, eval=FALSE}
# Test with 10 parameter combinations
set.seed(999)
sample_params <- param_grid[sample(1:nrow(param_grid), 10), ]

cat(sprintf("Running local benchmark (%d parameter combinations)...\n",
            nrow(sample_params)))
local_start <- Sys.time()

local_results <- lapply(1:nrow(sample_params), function(i) {
  cv_evaluate(sample_params[i, ], X_train, y_train, n_folds = 5)
})

local_time <- as.numeric(difftime(Sys.time(), local_start, units = "mins"))

cat(sprintf("✓ Completed in %.2f minutes\n", local_time))
cat(sprintf("  Average time per combination: %.1f seconds\n",
            local_time * 60 / nrow(sample_params)))
cat(sprintf("  Estimated time for full grid (%d combinations): %.1f minutes\n",
            nrow(param_grid), local_time * nrow(param_grid) / nrow(sample_params)))
```

**Typical output**:
```
Running local benchmark (10 parameter combinations)...
✓ Completed in 4.2 minutes
  Average time per combination: 25.2 seconds
  Estimated time for full grid (81 combinations): 34.0 minutes
```

For full grid search locally: **~34 minutes**

## Cloud Execution with staRburst

Run the complete grid search in parallel on AWS:

```{r cloud, eval=FALSE}
n_workers <- 27  # Process ~3 parameter combinations per worker

cat(sprintf("Running grid search (%d combinations) on %d workers...\n",
            nrow(param_grid), n_workers))

cloud_start <- Sys.time()

results <- starburst_map(
  1:nrow(param_grid),
  function(i) cv_evaluate(param_grid[i, ], X_train, y_train, n_folds = 5),
  workers = n_workers,
  cpu = 2,
  memory = "4GB"
)

cloud_time <- as.numeric(difftime(Sys.time(), cloud_start, units = "mins"))

cat(sprintf("\n✓ Completed in %.2f minutes\n", cloud_time))
```

**Typical output**:
```
🚀 Starting starburst cluster with 27 workers
💰 Estimated cost: ~$2.16/hour
📊 Processing 81 items with 27 workers
📦 Created 27 chunks (avg 3 items per chunk)
🚀 Submitting tasks...
✓ Submitted 27 tasks
⏳ Progress: 27/27 tasks (1.4 minutes elapsed)

✓ Completed in 1.4 minutes
💰 Actual cost: $0.05
```

## Results Analysis

Find the best hyperparameters:

```{r analysis, eval=FALSE}
# Extract results
cv_scores <- sapply(results, function(x) x$mean_cv_score)
cv_stds <- sapply(results, function(x) x$std_cv_score)

# Combine with parameters
results_df <- cbind(param_grid,
                   mean_score = cv_scores,
                   std_score = cv_stds)

# Sort by performance
results_df <- results_df[order(-results_df$mean_score), ]

cat("\n=== Grid Search Results ===\n\n")
cat(sprintf("Total combinations evaluated: %d\n", nrow(results_df)))
cat(sprintf("Best CV score: %.4f (± %.4f)\n",
            results_df$mean_score[1], results_df$std_score[1]))

cat("\n=== Best Hyperparameters ===\n")
cat(sprintf("  Learning rate: %.3f\n", results_df$learning_rate[1]))
cat(sprintf("  Max depth: %d\n", results_df$max_depth[1]))
cat(sprintf("  Subsample: %.2f\n", results_df$subsample[1]))
cat(sprintf("  Min child weight: %d\n", results_df$min_child_weight[1]))

cat("\n=== Top 5 Parameter Combinations ===\n")
for (i in 1:5) {
  cat(sprintf("\n%d. Score: %.4f (± %.4f)\n", i,
              results_df$mean_score[i], results_df$std_score[i]))
  cat(sprintf("   lr=%.3f, depth=%d, subsample=%.2f, min_child=%d\n",
              results_df$learning_rate[i],
              results_df$max_depth[i],
              results_df$subsample[i],
              results_df$min_child_weight[i]))
}

# Parameter importance analysis
cat("\n=== Parameter Impact Analysis ===\n")
for (param in c("learning_rate", "max_depth", "subsample", "min_child_weight")) {
  param_means <- aggregate(mean_score ~ get(param),
                          data = results_df, FUN = mean)
  names(param_means)[1] <- param

  cat(sprintf("\n%s:\n", param))
  for (i in 1:nrow(param_means)) {
    cat(sprintf("  %s: %.4f\n",
                param_means[i, 1],
                param_means[i, 2]))
  }
}

# Visualize results (if in interactive session)
if (interactive()) {
  # Score distribution
  hist(results_df$mean_score,
       breaks = 20,
       main = "Distribution of Cross-Validation Scores",
       xlab = "Mean CV Score",
       col = "lightblue",
       border = "white")
  abline(v = results_df$mean_score[1], col = "red", lwd = 2, lty = 2)

  # Learning rate effect
  boxplot(mean_score ~ learning_rate, data = results_df,
          main = "Learning Rate Impact",
          xlab = "Learning Rate",
          ylab = "CV Score",
          col = "lightgreen")
}
```

**Typical output**:
```
=== Grid Search Results ===

Total combinations evaluated: 81
Best CV score: 0.7842 (± 0.0234)

=== Best Hyperparameters ===
  Learning rate: 0.050
  Max depth: 5
  Subsample: 0.80
  Min child weight: 3

=== Top 5 Parameter Combinations ===

1. Score: 0.7842 (± 0.0234)
   lr=0.050, depth=5, subsample=0.80, min_child=3

2. Score: 0.7821 (± 0.0245)
   lr=0.050, depth=7, subsample=0.80, min_child=3

3. Score: 0.7798 (± 0.0256)
   lr=0.100, depth=5, subsample=0.80, min_child=1

4. Score: 0.7776 (± 0.0241)
   lr=0.050, depth=5, subsample=1.00, min_child=3

5. Score: 0.7754 (± 0.0267)
   lr=0.050, depth=5, subsample=0.80, min_child=1

=== Parameter Impact Analysis ===

learning_rate:
  0.01: 0.7512
  0.05: 0.7689
  0.1: 0.7598

max_depth:
  3: 0.7487
  5: 0.7712
  7: 0.7601

subsample:
  0.6: 0.7523
  0.8: 0.7734
  1: 0.7642

min_child_weight:
  1: 0.7634
  3: 0.7689
  5: 0.7576
```

## Performance Comparison

| Method | Combinations | Time | Cost | Speedup |
|--------|--------------|------|------|---------|
| Local | 10 | 4.2 min | $0 | - |
| Local (est.) | 81 | 34 min | $0 | 1x |
| staRburst | 81 | 1.4 min | $0.05 | 24.3x |

**Key Insights**:
- Excellent speedup (24x) for grid search
- Cost remains minimal ($0.05) despite massive parallelization
- Can evaluate 81 combinations in the time it takes to run ~3 locally
- Enables exploration of much larger parameter spaces

## Advanced: Random Search

Extend to random search for efficiency:

```{r random-search, eval=FALSE}
# Generate random parameter combinations
n_random <- 100

random_params <- data.frame(
  learning_rate = runif(n_random, 0.001, 0.3),
  max_depth = sample(2:10, n_random, replace = TRUE),
  subsample = runif(n_random, 0.5, 1.0),
  min_child_weight = sample(1:10, n_random, replace = TRUE),
  stringsAsFactors = FALSE
)

cat(sprintf("Running random search (%d combinations)...\n", n_random))

random_results <- starburst_map(
  1:nrow(random_params),
  function(i) cv_evaluate(random_params[i, ], X_train, y_train, n_folds = 5),
  workers = 33,
  cpu = 2,
  memory = "4GB"
)

# Find best parameters
random_scores <- sapply(random_results, function(x) x$mean_cv_score)
best_idx <- which.max(random_scores)

cat("\nBest random search result:\n")
cat(sprintf("  Score: %.4f\n", random_scores[best_idx]))
cat(sprintf("  Learning rate: %.4f\n", random_params$learning_rate[best_idx]))
cat(sprintf("  Max depth: %d\n", random_params$max_depth[best_idx]))
```

## Advanced: Bayesian Optimization

Implement iterative Bayesian optimization:

```{r bayesian, eval=FALSE}
# Bayesian optimization would involve:
# 1. Evaluate a small initial set (e.g., 10 combinations)
# 2. Fit a Gaussian process to predict performance
# 3. Use acquisition function to select next promising points
# 4. Evaluate new points in parallel
# 5. Repeat until convergence

# This requires specialized packages like mlrMBO or rBayesianOptimization
# but can be parallelized with starburst for the evaluation step
```

## When to Use This Pattern

**Good fit**:
- Expensive model training (> 10 seconds per fit)
- Large parameter spaces (> 20 combinations)
- Cross-validation with multiple folds
- Ensemble model tuning
- Neural architecture search

**Not ideal**:
- Very fast models (< 1 second per fit)
- Small parameter spaces (< 10 combinations)
- Single train/validation split
- Real-time model updates

## Running the Full Example

The complete runnable script is available at:
```{r, eval=FALSE}
system.file("examples/grid-search.R", package = "starburst")
```

Run it with:
```{r, eval=FALSE}
source(system.file("examples/grid-search.R", package = "starburst"))
```

## Next Steps

- Integrate with xgboost, lightgbm, or other ML libraries
- Implement early stopping for faster searches
- Add random search and Bayesian optimization
- Scale to larger datasets and deeper networks
- Use nested cross-validation for model selection

**Related examples**:
- [Feature Engineering](example-feature-engineering.html) - Parallel feature computation
- [Bootstrap CI](example-bootstrap.html) - Model uncertainty estimation
- [Monte Carlo Simulation](example-monte-carlo.html) - Similar parallel pattern