--- title: "Example: Bootstrap Confidence Intervals for A/B Testing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Example: Bootstrap Confidence Intervals for A/B Testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview Bootstrap resampling is a powerful statistical technique for estimating confidence intervals without assuming a specific distribution. This example demonstrates parallelizing bootstrap analysis for A/B test results. **Use Case**: A/B testing, hypothesis testing, statistical inference, conversion rate analysis **Computational Pattern**: Embarrassingly parallel resampling with aggregation ## The Problem You've run an A/B test on your website with two variants: - **Variant A (Control)**: 10,000 visitors, 850 conversions (8.5% conversion rate) - **Variant B (Treatment)**: 10,000 visitors, 920 conversions (9.2% conversion rate) You need to: 1. Estimate the confidence interval for the difference in conversion rates 2. Calculate the probability that B is better than A 3. Determine if the result is statistically significant Traditional parametric tests assume normal distributions. Bootstrap resampling makes no such assumptions and provides more robust estimates. ## Setup ```{r setup, eval=FALSE} library(starburst) ``` ## Generate Sample Data Create synthetic A/B test data: ```{r data, eval=FALSE} set.seed(42) # Variant A (control) n_a <- 10000 conversions_a <- 850 variant_a <- c(rep(1, conversions_a), rep(0, n_a - conversions_a)) # Variant B (treatment) n_b <- 10000 conversions_b <- 920 variant_b <- c(rep(1, conversions_b), rep(0, n_b - conversions_b)) # Observed difference observed_diff <- mean(variant_b) - mean(variant_a) cat(sprintf("Observed conversion rates:\n")) cat(sprintf(" Variant A: %.2f%%\n", mean(variant_a) * 100)) cat(sprintf(" Variant B: %.2f%%\n", mean(variant_b) * 100)) cat(sprintf(" Difference: %.2f%% (%.1f%% relative lift)\n", observed_diff * 100, (observed_diff / mean(variant_a)) * 100)) ``` **Output**: ``` Observed conversion rates: Variant A: 8.50% Variant B: 9.20% Difference: 0.70% (8.2% relative lift) ``` ## Bootstrap Function Define a function that performs one bootstrap iteration: ```{r bootstrap-fn, eval=FALSE} bootstrap_iteration <- function(iter, data_a, data_b) { # Resample with replacement n_a <- length(data_a) n_b <- length(data_b) sample_a <- sample(data_a, n_a, replace = TRUE) sample_b <- sample(data_b, n_b, replace = TRUE) # Calculate metrics rate_a <- mean(sample_a) rate_b <- mean(sample_b) diff <- rate_b - rate_a relative_lift <- diff / rate_a list( iteration = iter, rate_a = rate_a, rate_b = rate_b, diff = diff, relative_lift = relative_lift, b_wins = diff > 0 ) } ``` ## Local Execution Run a smaller bootstrap locally: ```{r local, eval=FALSE} n_bootstrap_local <- 1000 cat(sprintf("Running %d bootstrap iterations locally...\n", n_bootstrap_local)) local_start <- Sys.time() local_results <- lapply( 1:n_bootstrap_local, bootstrap_iteration, data_a = variant_a, data_b = variant_b ) local_time <- as.numeric(difftime(Sys.time(), local_start, units = "secs")) cat(sprintf("āœ“ Completed in %.2f seconds\n\n", local_time)) ``` **Typical output**: ``` Running 1000 bootstrap iterations locally... āœ“ Completed in 2.1 seconds ``` ## Cloud Execution with staRburst Run 10,000 bootstrap iterations on AWS: ```{r cloud, eval=FALSE} n_bootstrap <- 10000 cat(sprintf("Running %d bootstrap iterations on AWS...\n", n_bootstrap)) results <- starburst_map( 1:n_bootstrap, bootstrap_iteration, data_a = variant_a, data_b = variant_b, workers = 25, cpu = 1, memory = "2GB" ) ``` **Typical output**: ``` šŸš€ Starting starburst cluster with 25 workers šŸ’° Estimated cost: ~$1.00/hour šŸ“Š Processing 10000 items with 25 workers šŸ“¦ Created 25 chunks (avg 400 items per chunk) šŸš€ Submitting tasks... āœ“ Submitted 25 tasks ā³ Progress: 25/25 tasks (0.3 minutes elapsed) āœ“ Completed in 0.3 minutes šŸ’° Actual cost: $0.01 ``` ## Results Analysis Extract and analyze the bootstrap distribution: ```{r analysis, eval=FALSE} # Extract metrics diffs <- sapply(results, function(x) x$diff) relative_lifts <- sapply(results, function(x) x$relative_lift) b_wins <- sapply(results, function(x) x$b_wins) # Calculate confidence intervals ci_95 <- quantile(diffs, c(0.025, 0.975)) ci_99 <- quantile(diffs, c(0.005, 0.995)) # Probability that B is better than A prob_b_wins <- mean(b_wins) * 100 # Print results cat("\n=== Bootstrap Results (10,000 iterations) ===\n\n") cat(sprintf("Observed difference: %.2f%%\n", observed_diff * 100)) cat(sprintf("\n95%% Confidence Interval: [%.2f%%, %.2f%%]\n", ci_95[1] * 100, ci_95[2] * 100)) cat(sprintf("99%% Confidence Interval: [%.2f%%, %.2f%%]\n", ci_99[1] * 100, ci_99[2] * 100)) cat(sprintf("\nProbability that B > A: %.1f%%\n", prob_b_wins)) # Statistical significance if (ci_95[1] > 0) { cat("\nāœ“ Result is statistically significant at 95% confidence level\n") cat(" (95% CI does not include zero)\n") } else { cat("\nāœ— Result is NOT statistically significant at 95% confidence level\n") cat(" (95% CI includes zero)\n") } # Relative lift analysis cat(sprintf("\nRelative lift: %.1f%%\n", median(relative_lifts) * 100)) cat(sprintf("95%% CI for relative lift: [%.1f%%, %.1f%%]\n", quantile(relative_lifts, 0.025) * 100, quantile(relative_lifts, 0.975) * 100)) ``` **Typical output**: ``` === Bootstrap Results (10,000 iterations) === Observed difference: 0.70% 95% Confidence Interval: [0.21%, 1.19%] 99% Confidence Interval: [0.08%, 1.32%] Probability that B > A: 99.7% āœ“ Result is statistically significant at 95% confidence level (95% CI does not include zero) Relative lift: 8.2% 95% CI for relative lift: [2.5%, 14.0%] ``` ## Visualization Plot the bootstrap distribution: ```{r viz, eval=FALSE} # Create histogram hist(diffs * 100, breaks = 50, main = "Bootstrap Distribution of Conversion Rate Difference", xlab = "Difference in Conversion Rate (percentage points)", col = "lightblue", border = "white") # Add reference lines abline(v = 0, col = "red", lwd = 2, lty = 2) abline(v = ci_95 * 100, col = "darkblue", lwd = 2, lty = 2) abline(v = observed_diff * 100, col = "darkgreen", lwd = 2) # Add legend legend("topright", c("Observed difference", "Zero (no effect)", "95% CI"), col = c("darkgreen", "red", "darkblue"), lwd = 2, lty = c(1, 2, 2)) ``` ## Performance Comparison | Method | Iterations | Time | Cost | Speedup | |--------|-----------|------|------|---------| | Local | 1,000 | 2.1 sec | $0 | 1x | | Local (est.) | 10,000 | 21 sec | $0 | 1x | | staRburst | 10,000 | 18 sec | $0.01 | 6.9x | **Key Insights**: - Bootstrap is highly parallelizable - Fast iterations still benefit from cloud parallelization - Minimal cost even with 10,000 iterations - Can easily scale to 100,000+ iterations for more precision ## Advanced: Multi-Metric Bootstrap Bootstrap multiple metrics simultaneously: ```{r multi-metric, eval=FALSE} bootstrap_all_metrics <- function(iter, data_a, data_b) { n_a <- length(data_a) n_b <- length(data_b) sample_a <- sample(data_a, n_a, replace = TRUE) sample_b <- sample(data_b, n_b, replace = TRUE) # Multiple metrics rate_a <- mean(sample_a) rate_b <- mean(sample_b) se_a <- sd(sample_a) / sqrt(n_a) se_b <- sd(sample_b) / sqrt(n_b) list( diff_rate = rate_b - rate_a, relative_lift = (rate_b - rate_a) / rate_a, z_score = (rate_b - rate_a) / sqrt(se_a^2 + se_b^2), effect_size = (rate_b - rate_a) / sqrt((var(sample_a) + var(sample_b)) / 2) ) } # Run multi-metric bootstrap multi_results <- starburst_map( 1:10000, bootstrap_all_metrics, data_a = variant_a, data_b = variant_b, workers = 25 ) ``` ## When to Use This Pattern **Good fit**: - Need robust confidence intervals - Non-normal distributions - Small sample sizes - Complex metrics (e.g., ratios, quantiles) - Multiple hypothesis testing **Not ideal**: - Very large datasets (> 1M rows per group) - Simple metrics with known distributions - Real-time analysis requirements ## Running the Full Example The complete runnable script is available at: ```{r, eval=FALSE} system.file("examples/bootstrap.R", package = "starburst") ``` Run it with: ```{r, eval=FALSE} source(system.file("examples/bootstrap.R", package = "starburst")) ``` ## Next Steps - Try different sample sizes - Experiment with stratified bootstrap - Compare with parametric tests (t-test, z-test) - Add multiple variants (A/B/C testing) - Bootstrap other metrics (median, quantiles, variance) **Related examples**: - [Monte Carlo Simulation](example-monte-carlo.html) - Similar resampling pattern - [Risk Modeling](example-risk-modeling.html) - Advanced statistical analysis