---
title: "Quick Start"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quick Start}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(BORG)
```

# Why Your Test Accuracy Might Be Wrong

A model shows 95% accuracy on test data, then drops to 60% in production. The usual culprit: data leakage.

Leakage happens when information from your test set contaminates training. Common causes:

- Preprocessing (scaling, PCA) fitted on all data before splitting

- Features derived from the outcome variable

- Same patient/site appearing in both train and test

- Random CV on spatially autocorrelated data

BORG checks for these problems before you compute metrics.

## Basic Usage

```{r basic-usage}
# Create sample data
set.seed(42)
data <- data.frame(
  x1 = rnorm(100),
  x2 = rnorm(100),
  y = rnorm(100)
)

# Define a split
train_idx <- 1:70
test_idx <- 71:100

# Inspect the split
result <- borg_inspect(data, train_idx = train_idx, test_idx = test_idx)
result
```

No violations detected. But what if we made a mistake?

```{r overlap-detection}
# Accidental overlap in indices
bad_result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
bad_result
```

BORG caught the overlap immediately.

## The Main Entry Point: `borg()`

For most workflows, `borg()` is all you need. It handles two modes:

### Mode 1: Diagnose Data Dependencies

When you have structured data (spatial coordinates, time column, or groups), BORG diagnoses dependencies and generates appropriate CV folds:

```{r diagnosis-mode}
# Spatial data with coordinates
set.seed(42)
spatial_data <- data.frame(
  lon = runif(200, -10, 10),
  lat = runif(200, -10, 10),
  elevation = rnorm(200, 500, 100),
  response = rnorm(200)
)

# Let BORG diagnose and create CV folds
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
result
```

BORG detected spatial structure and recommended spatial block CV instead of random CV.

### Mode 2: Validate Existing Splits

When you have your own train/test indices, BORG validates them:

```{r validation-mode}
# Validate a manual split
risk <- borg(spatial_data, train_idx = 1:150, test_idx = 151:200)
risk
```

## Visualizing Results

Use standard R `plot()` and `summary()`:

```{r plot-results, fig.width=7, fig.height=5}
# Plot the risk assessment
plot(risk)
```

```{r summary-results}
# Generate methods text for publications
summary(result)
```

## Data Dependency Types

BORG handles three types of data dependencies:

### Spatial Autocorrelation

Points close together tend to have similar values. Random CV underestimates error because train and test points are intermixed.

```{r spatial-example}
result_spatial <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
result_spatial$diagnosis@recommended_cv
```

### Temporal Autocorrelation

Sequential observations are correlated. Future data must not leak into past predictions.

```{r temporal-example}
temporal_data <- data.frame(
  date = seq(as.Date("2020-01-01"), by = "day", length.out = 200),
  value = cumsum(rnorm(200))
)

result_temporal <- borg(temporal_data, time = "date", target = "value")
result_temporal$diagnosis@recommended_cv
```

### Clustered/Grouped Data

Observations within groups (patients, sites, species) are more similar than between groups.

```{r grouped-example}
grouped_data <- data.frame(
  site = rep(1:20, each = 10),
  measurement = rnorm(200)
)

result_grouped <- borg(grouped_data, groups = "site", target = "measurement")
result_grouped$diagnosis@recommended_cv
```

## Risk Categories

BORG classifies risks into two categories:

### Hard Violations (Evaluation Invalid)

These invalidate your results completely:

| Risk | Description |
|------|-------------|
| `index_overlap` | Same row in both train and test |
| `duplicate_rows` | Identical observations in train and test |
| `target_leakage` | Feature with |r| > 0.99 with target |
| `group_leakage` | Same group in train and test |
| `temporal_leakage` | Test data predates training data |
| `preprocessing_leakage` | Scaler/PCA fitted on full data |

### Soft Inflation (Results Biased)

These inflate metrics but don't completely invalidate:

| Risk | Description |
|------|-------------|
| `proxy_leakage` | Feature with |r| 0.95-0.99 with target |
| `spatial_proximity` | Test points too close to train |
| `random_cv_inflation` | Random CV on dependent data |

## Detecting Specific Leakage Types

### Target Leakage

Features derived from the outcome:

```{r target-leakage}
# Simulate target leakage
leaky_data <- data.frame(
  x = rnorm(100),
  leaked_feature = rnorm(100),  # Will be made leaky
  outcome = rnorm(100)
)
# Make leaked_feature highly correlated with outcome
leaky_data$leaked_feature <- leaky_data$outcome + rnorm(100, sd = 0.05)

result <- borg_inspect(leaky_data, train_idx = 1:70, test_idx = 71:100,
                       target = "outcome")
result
```

### Group Leakage

Same entity in train and test:

```{r group-leakage}
# Simulate clinical data with patient IDs
clinical_data <- data.frame(
  patient_id = rep(1:10, each = 10),
  visit = rep(1:10, times = 10),
  measurement = rnorm(100)
)

# Random split ignoring patients (BAD)
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]

# Check for group leakage
result <- borg_inspect(clinical_data, train_idx = train_idx, test_idx = test_idx,
                       groups = "patient_id")
result
```

## Working with CV Folds

Access the generated folds directly:

```{r cv-folds}
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response", v = 5)

# Number of folds
length(result$folds)

# First fold's train/test sizes
cat("Fold 1 - Train:", length(result$folds[[1]]$train),
    "Test:", length(result$folds[[1]]$test), "\n")
```

## Exporting Results

For reproducibility, export validation certificates:

```{r certificate}
# Create a certificate
cert <- borg_certificate(result$diagnosis, data = spatial_data)
cert
```

```{r export, eval=FALSE}
# Export to file
borg_export(result$diagnosis, spatial_data, "validation.yaml")
borg_export(result$diagnosis, spatial_data, "validation.json")
```

## Writing Methods Sections

`summary()` generates publication-ready methods paragraphs that include the
statistical tests BORG ran, the dependency type detected, and the CV strategy
chosen. Three citation styles are supported:

```{r methods-text}
# Default APA style
result <- borg(spatial_data, coords = c("lon", "lat"), target = "response")
methods_text <- summary(result)
```

```{r methods-nature, eval=FALSE}
# Nature style
summary(result, style = "nature")

# Ecology style
summary(result, style = "ecology")
```

The returned text is a character string you can paste directly into a manuscript.
If you also ran `borg_compare_cv()`, pass the comparison object to include
empirical inflation estimates:

```{r methods-with-comparison, eval=FALSE}
comparison <- borg_compare_cv(spatial_data, response ~ lon + lat,
                              coords = c("lon", "lat"))
summary(result, comparison = comparison)
```

## Empirical CV Comparison

When reviewers ask "does it really matter?", `borg_compare_cv()` runs both
random and blocked CV on the same data and model, then tests whether the
difference is statistically significant:

```{r compare-cv}
comparison <- borg_compare_cv(
  spatial_data,
  formula = response ~ lon + lat,
  coords = c("lon", "lat"),
  v = 5,
  repeats = 5  # Use more repeats in practice
)
print(comparison)
```

```{r compare-cv-plot, fig.width=7, fig.height=5}
plot(comparison)
```

## Power Analysis After Blocking

Switching from random to blocked CV reduces effective sample size. Before
committing to blocked CV, check whether your dataset is large enough:

```{r power-analysis}
# Clustered data: 20 sites, 10 observations each
clustered_data <- data.frame(
  site = rep(1:20, each = 10),
  value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)

pw <- borg_power(clustered_data, groups = "site", target = "value")
print(pw)
summary(pw)
```

## Interface Summary

| Function | Purpose |
|----------|---------|
| `borg()` | Main entry point — diagnose data or validate splits |
| `borg_inspect()` | Detailed inspection of train/test split |
| `borg_diagnose()` | Analyze data dependencies only |
| `borg_compare_cv()` | Empirical random vs blocked CV comparison |
| `borg_power()` | Power analysis after blocking |
| `plot()` | Visualize results |
| `summary()` | Generate methods text for papers |
| `borg_certificate()` | Create validation certificate |
| `borg_export()` | Export certificate to YAML/JSON |

## See Also

- `vignette("risk-taxonomy")` - Complete catalog of detectable risks

- `vignette("frameworks")` - Integration with caret, tidymodels, mlr3