--- title: "Quick Start" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) library(BORG) ``` # Why Your Test Accuracy Might Be Wrong A model shows 95% accuracy on test data, then drops to 60% in production. The usual culprit: data leakage. Leakage happens when information from your test set contaminates training. Common causes: - Preprocessing (scaling, PCA) fitted on all data before splitting - Features derived from the outcome variable - Same patient/site appearing in both train and test - Random CV on spatially autocorrelated data BORG checks for these problems before you compute metrics. ## Basic Usage ```{r basic-usage} # Create sample data set.seed(42) data <- data.frame( x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100) ) # Define a split train_idx <- 1:70 test_idx <- 71:100 # Inspect the split result <- borg_inspect(data, train_idx = train_idx, test_idx = test_idx) result ``` No violations detected. But what if we made a mistake? ```{r overlap-detection} # Accidental overlap in indices bad_result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100) bad_result ``` BORG caught the overlap immediately. ## The Main Entry Point: `borg()` For most workflows, `borg()` is all you need. It handles two modes: ### Mode 1: Diagnose Data Dependencies When you have structured data (spatial coordinates, time column, or groups), BORG diagnoses dependencies and generates appropriate CV folds: ```{r diagnosis-mode} # Spatial data with coordinates set.seed(42) spatial_data <- data.frame( lon = runif(200, -10, 10), lat = runif(200, -10, 10), elevation = rnorm(200, 500, 100), response = rnorm(200) ) # Let BORG diagnose and create CV folds result <- borg(spatial_data, coords = c("lon", "lat"), target = "response") result ``` BORG detected spatial structure and recommended spatial block CV instead of random CV. ### Mode 2: Validate Existing Splits When you have your own train/test indices, BORG validates them: ```{r validation-mode} # Validate a manual split risk <- borg(spatial_data, train_idx = 1:150, test_idx = 151:200) risk ``` ## Visualizing Results Use standard R `plot()` and `summary()`: ```{r plot-results, fig.width=7, fig.height=5} # Plot the risk assessment plot(risk) ``` ```{r summary-results} # Generate methods text for publications summary(result) ``` ## Data Dependency Types BORG handles three types of data dependencies: ### Spatial Autocorrelation Points close together tend to have similar values. Random CV underestimates error because train and test points are intermixed. ```{r spatial-example} result_spatial <- borg(spatial_data, coords = c("lon", "lat"), target = "response") result_spatial$diagnosis@recommended_cv ``` ### Temporal Autocorrelation Sequential observations are correlated. Future data must not leak into past predictions. ```{r temporal-example} temporal_data <- data.frame( date = seq(as.Date("2020-01-01"), by = "day", length.out = 200), value = cumsum(rnorm(200)) ) result_temporal <- borg(temporal_data, time = "date", target = "value") result_temporal$diagnosis@recommended_cv ``` ### Clustered/Grouped Data Observations within groups (patients, sites, species) are more similar than between groups. ```{r grouped-example} grouped_data <- data.frame( site = rep(1:20, each = 10), measurement = rnorm(200) ) result_grouped <- borg(grouped_data, groups = "site", target = "measurement") result_grouped$diagnosis@recommended_cv ``` ## Risk Categories BORG classifies risks into two categories: ### Hard Violations (Evaluation Invalid) These invalidate your results completely: | Risk | Description | |------|-------------| | `index_overlap` | Same row in both train and test | | `duplicate_rows` | Identical observations in train and test | | `target_leakage` | Feature with |r| > 0.99 with target | | `group_leakage` | Same group in train and test | | `temporal_leakage` | Test data predates training data | | `preprocessing_leakage` | Scaler/PCA fitted on full data | ### Soft Inflation (Results Biased) These inflate metrics but don't completely invalidate: | Risk | Description | |------|-------------| | `proxy_leakage` | Feature with |r| 0.95-0.99 with target | | `spatial_proximity` | Test points too close to train | | `random_cv_inflation` | Random CV on dependent data | ## Detecting Specific Leakage Types ### Target Leakage Features derived from the outcome: ```{r target-leakage} # Simulate target leakage leaky_data <- data.frame( x = rnorm(100), leaked_feature = rnorm(100), # Will be made leaky outcome = rnorm(100) ) # Make leaked_feature highly correlated with outcome leaky_data$leaked_feature <- leaky_data$outcome + rnorm(100, sd = 0.05) result <- borg_inspect(leaky_data, train_idx = 1:70, test_idx = 71:100, target = "outcome") result ``` ### Group Leakage Same entity in train and test: ```{r group-leakage} # Simulate clinical data with patient IDs clinical_data <- data.frame( patient_id = rep(1:10, each = 10), visit = rep(1:10, times = 10), measurement = rnorm(100) ) # Random split ignoring patients (BAD) set.seed(123) all_idx <- sample(100) train_idx <- all_idx[1:70] test_idx <- all_idx[71:100] # Check for group leakage result <- borg_inspect(clinical_data, train_idx = train_idx, test_idx = test_idx, groups = "patient_id") result ``` ## Working with CV Folds Access the generated folds directly: ```{r cv-folds} result <- borg(spatial_data, coords = c("lon", "lat"), target = "response", v = 5) # Number of folds length(result$folds) # First fold's train/test sizes cat("Fold 1 - Train:", length(result$folds[[1]]$train), "Test:", length(result$folds[[1]]$test), "\n") ``` ## Exporting Results For reproducibility, export validation certificates: ```{r certificate} # Create a certificate cert <- borg_certificate(result$diagnosis, data = spatial_data) cert ``` ```{r export, eval=FALSE} # Export to file borg_export(result$diagnosis, spatial_data, "validation.yaml") borg_export(result$diagnosis, spatial_data, "validation.json") ``` ## Writing Methods Sections `summary()` generates publication-ready methods paragraphs that include the statistical tests BORG ran, the dependency type detected, and the CV strategy chosen. Three citation styles are supported: ```{r methods-text} # Default APA style result <- borg(spatial_data, coords = c("lon", "lat"), target = "response") methods_text <- summary(result) ``` ```{r methods-nature, eval=FALSE} # Nature style summary(result, style = "nature") # Ecology style summary(result, style = "ecology") ``` The returned text is a character string you can paste directly into a manuscript. If you also ran `borg_compare_cv()`, pass the comparison object to include empirical inflation estimates: ```{r methods-with-comparison, eval=FALSE} comparison <- borg_compare_cv(spatial_data, response ~ lon + lat, coords = c("lon", "lat")) summary(result, comparison = comparison) ``` ## Empirical CV Comparison When reviewers ask "does it really matter?", `borg_compare_cv()` runs both random and blocked CV on the same data and model, then tests whether the difference is statistically significant: ```{r compare-cv} comparison <- borg_compare_cv( spatial_data, formula = response ~ lon + lat, coords = c("lon", "lat"), v = 5, repeats = 5 # Use more repeats in practice ) print(comparison) ``` ```{r compare-cv-plot, fig.width=7, fig.height=5} plot(comparison) ``` ## Power Analysis After Blocking Switching from random to blocked CV reduces effective sample size. Before committing to blocked CV, check whether your dataset is large enough: ```{r power-analysis} # Clustered data: 20 sites, 10 observations each clustered_data <- data.frame( site = rep(1:20, each = 10), value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5) ) pw <- borg_power(clustered_data, groups = "site", target = "value") print(pw) summary(pw) ``` ## Interface Summary | Function | Purpose | |----------|---------| | `borg()` | Main entry point — diagnose data or validate splits | | `borg_inspect()` | Detailed inspection of train/test split | | `borg_diagnose()` | Analyze data dependencies only | | `borg_compare_cv()` | Empirical random vs blocked CV comparison | | `borg_power()` | Power analysis after blocking | | `plot()` | Visualize results | | `summary()` | Generate methods text for papers | | `borg_certificate()` | Create validation certificate | | `borg_export()` | Export certificate to YAML/JSON | ## See Also - `vignette("risk-taxonomy")` - Complete catalog of detectable risks - `vignette("frameworks")` - Integration with caret, tidymodels, mlr3