---
title: "Getting Started with achieveGap"
author: "[Your Name], Michigan State University"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with achieveGap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 7,
  fig.height = 4.5,
  warning   = FALSE,
  message   = FALSE
)
```

## Overview

The **achieveGap** package provides a statistically rigorous framework for
estimating achievement gap *trajectories* in longitudinal educational data.
Rather than computing the gap as a post hoc difference between two
separately estimated curves, the gap function is parameterized directly
as a smooth estimand within a joint hierarchical model. This ensures
correct uncertainty quantification via simultaneous confidence bands.

The package is motivated by the observation that achievement gaps often
evolve nonlinearly across grades — widening rapidly in early elementary
school, plateauing in middle grades, or narrowing following policy
interventions — patterns that standard linear growth models cannot capture.

---

## Installation

```r
# From CRAN (once published):
install.packages("achieveGap")

# From GitHub (development version):
# install.packages("devtools")
devtools::install_github("yourusername/achieveGap")
```

---

## Quickstart: Simulated Data

The simplest way to get started is with `simulate_gap()`, which generates
synthetic longitudinal data with a known true gap function.

```{r simulate}
library(achieveGap)

# Generate data: 400 students in 30 schools, grades K-7
# Gap shape: monotone widening (true gap increases steadily across grades)
sim <- simulate_gap(
  n_students = 400,
  n_schools  = 30,
  gap_shape  = "monotone",
  seed       = 2024
)

# Preview the data
head(sim$data)
```

```{r true-gap-table}
# The true gap at each grade
sim$true_gap
```

The data frame has one row per student-grade observation, with columns
`student`, `grade`, `school`, `SES_group` (0 = high SES, 1 = low SES),
and `score`.

---

## Fitting the Model

The main function is `gap_trajectory()`. You specify the column names for
the outcome, grade, group indicator, school ID, and student ID.

```{r fit, cache = TRUE}
# Formula interface (recommended) — same as lme4/nlme style
fit <- achieve_gap(
  score ~ grade,
  group  = "SES_group",
  random = ~ 1 | school/student,
  data   = sim$data,
  k      = 6,     # spline basis dimension (must be < unique grade values)
  n_sim  = 5000   # posterior draws for simultaneous bands
)
```

Printing the fitted object gives a concise overview:

```{r print}
print(fit)
```

---

## Summarizing Results

`summary()` displays estimated gap values with standard errors and
simultaneous confidence band bounds at equally spaced grade points.
Grades marked with `*` have bands that exclude zero — statistically
significant gap with multiplicity control.

```{r summary}
summary(fit)
```

---

## Visualizing the Gap Trajectory

`plot()` produces a publication-ready figure. By default, both the
simultaneous band (light shading) and pointwise band (dark shading) are
shown. In simulation settings where the true gap is known, you can overlay
it with the `true_gap` argument.

```{r plot-with-truth, fig.cap = "Estimated gap with both confidence bands and true gap overlaid."}
# Grade labels for x-axis
grade_labs <- c("K", "G1", "G2", "G3", "G4", "G5", "G6", "G7")

# True gap evaluated at the model's grade grid
true_gap_vec <- sim$f1_fun(fit$grade_grid)

plot(
  fit,
  true_gap     = true_gap_vec,
  grade_labels = grade_labs,
  title        = "SES Achievement Gap Trajectory (Simulated Data)"
)
```

To show only the simultaneous band:

```{r plot-sim-only, fig.cap = "Gap trajectory with simultaneous band only."}
plot(fit, band = "simultaneous", grade_labels = grade_labs)
```

---

## Hypothesis Testing

`test_gap()` runs two types of tests:

1. **Global test**: Is the gap smooth significantly different from zero
   anywhere? (Approximate chi-squared test from `mgcv`.)
2. **Simultaneous test**: Which specific grade intervals have a
   statistically significant gap, with joint error rate control?

```{r test}
tryCatch(
  test_gap(fit, type = "both"),
  error = function(e) message("test_gap: ", conditionMessage(e))
)
```

---

## Comparing to Separate Splines

`fit_separate()` provides the comparison approach: two independent spline
models, one per group, with the gap computed by post hoc subtraction.
As discussed in the paper, this approach underestimates uncertainty.

```{r separate, cache = TRUE}
sep <- fit_separate(
  data    = sim$data,
  score   = "score",
  grade   = "grade",
  group   = "SES_group",
  school  = "school",
  student = "student"
)

# Compare gap estimates side by side
cat("Joint model gap at grade 4: ", round(fit$gap_hat[fit$grade_grid >= 3.9][1], 3), "\n")
cat("Separate model gap at grade 4:", round(sep$gap_hat[sep$grade_grid >= 3.9][1], 3), "\n")

# Separate model CIs are narrower (anti-conservative)
cat("\nMean CI width - Joint (pointwise):", round(mean(fit$pw_upper - fit$pw_lower), 3))
cat("\nMean CI width - Separate:         ", round(mean(sep$ci_upper - sep$ci_lower), 3), "\n")
```

---

## Non-Monotone Gap Example

The `nonmonotone` gap shape widens early, plateaus, and narrows slightly —
a pattern linear models cannot capture.

```{r nonmono, cache = TRUE, fig.cap = "Non-monotone gap trajectory."}
sim2 <- simulate_gap(n_students = 400, n_schools = 30,
                     gap_shape = "nonmonotone", seed = 99)

fit2 <- gap_trajectory(
  data    = sim2$data,
  score   = "score",
  grade   = "grade",
  group   = "SES_group",
  school  = "school",
  student = "student",
  n_sim   = 1000,
  verbose = FALSE
)

plot(fit2,
     true_gap     = sim2$f1_fun(fit2$grade_grid),
     grade_labels = grade_labs,
     title        = "Non-Monotone SES Gap: Widens Early, Plateaus, Narrows")
```

---

## Running a Benchmark Simulation

`run_simulation()` replicates the simulation study from the paper.
Use `n_reps = 5` for a quick test; the paper used `n_reps = 500`.

```{r sim-study, eval = FALSE}
results <- run_simulation(
  n_reps = 5,    # increase to 500 for paper results
  n_sim  = 1000  # increase to 5000 for final run
)
```

```{r sim-summary, eval = FALSE}
summarize_simulation(results)
```

---

## Using Your Own Data

To apply `gap_trajectory()` to real data, prepare a long-format data frame
with columns for:

- **outcome** (standardized test score, numeric)
- **grade/time** (numeric, e.g., 0 = kindergarten, 1 = grade 1, ...)
- **group** (binary 0/1; 0 = reference group, 1 = focal group)
- **school ID** (factor or integer)
- **student ID** (factor or integer)

```r
# Example with ECLS-K:2011 data (after loading and preparing)
fit_real <- achieve_gap(
  math_score_z ~ grade_numeric,
  group  = "low_ses",
  random = ~ 1 | school_id/child_id,
  data   = eclsk_clean,
  k      = 6,
  n_sim  = 10000
)

summary(fit_real)
plot(fit_real, grade_labels = c("K", "G1", "G2", "G3", "G4", "G5"))
test_gap(fit_real)
```

---

## Session Info

```{r session}
sessionInfo()
```