---
title: "Verification of sample size calculation via simulation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Verification of sample size calculation via simulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, message=FALSE, warning=FALSE}
library(gsDesignNB)
library(data.table)
library(ggplot2)
library(gt)
```

## Introduction

This vignette verifies the accuracy of the `sample_size_nbinom()` function by comparing its theoretical predictions for average exposure,
 statistical information, and power against results from a large-scale simulation study.

We specifically test a scenario with:

*   Piecewise constant accrual rates.
*   Piecewise exponential dropout (constant in this example).
*   Negative binomial outcomes.
*   Fixed follow-up design.

## Simulation design

The study sample size suggested by the Friede method is used to evaluate the accuracy of that method.
The parameters for both the theoretical calculation and the simulation study are as follows:

### Parameters

*   **Rates**: $\lambda_1 = 0.4$ (Control), $\lambda_2 = 0.3$ (Experimental).
*   **Dispersion**: $k = 0.5$.
*   **Power**: 90%.
*   **Alpha**: 0.025 (one-sided).
*   **Dropout**: 10% per year adjusted to monthly rate ($\delta = 0.1 / 12$).
*   **Trial duration**: 24 months.
*   **Max follow-up**: 12 months.
*   **Event gap**: 30 days (approx 0.082 years).
*   **Accrual**: Piecewise linear ramp-up over 12 months (Rate $R$ for 0-6mo, $2R$ for 6-12mo).

### Theoretical calculation

First, we calculate the required sample size and expected properties using `sample_size_nbinom()`.

```{r design}
# Parameters
lambda1 <- 0.4
lambda2 <- 0.3
dispersion <- 0.5
power <- 0.9
alpha <- 0.025
dropout_rate <- 0.1 / 12
max_followup <- 12
trial_duration <- 24
event_gap <- 20 / 30.42 # 20 days

# Accrual targeting 90% power
# We provide relative rates (1:2) and the function scales them to achieve power
accrual_rate_rel <- c(1, 2)
accrual_duration <- c(6, 6)

design <- sample_size_nbinom(
  lambda1 = lambda1, lambda2 = lambda2, dispersion = dispersion,
  power = power,
  alpha = alpha, sided = 1,
  accrual_rate = accrual_rate_rel,
  accrual_duration = accrual_duration,
  trial_duration = trial_duration,
  dropout_rate = dropout_rate,
  max_followup = max_followup,
  event_gap = event_gap
)

# Extract calculated absolute accrual rates
accrual_rate <- design$accrual_rate

print(design)
```

## Simulation results

We simulated 3,600 trials using the parameters defined above from the Friede method. 
This number of simulations was chosen to achieve a standard error for the power estimate 
of approximately 0.005 when the true power is 90% ($\sqrt{0.9 \times 0.1 / 3600} = 0.005$). 
The simulation script is located in `data-raw/generate_simulation_data.R`.

```{r load_results}
# Load pre-computed simulation results
results_file <- system.file("extdata", "simulation_results.rds", package = "gsDesignNB")

if (results_file == "" && file.exists("../inst/extdata/simulation_results.rds")) {
  results_file <- "../inst/extdata/simulation_results.rds"
}

if (results_file != "") {
  sim_data <- readRDS(results_file)
  results <- sim_data$results
  design_ref <- sim_data$design
} else {
  # Fallback if data is not available (e.g. not installed yet)
  # This block allows the vignette to build without the data, but warns.
  warning("Simulation results not found. Skipping verification plots.")
  results <- NULL
  design_ref <- design
}
```

### Summary of verification results

We compare the theoretical predictions from `sample_size_nbinom()` with the observed simulation results across multiple metrics.

**Key distinction: Total Exposure vs Exposure at Risk**

*   **Total Exposure** (`tte_total`): The calendar time a subject is on study, from randomization 
    to the analysis cut date (or censoring). This is the same for both treatment arms by design.
*   **Exposure at Risk** (`tte`): The time during which a subject can experience a *new* event. 
    After each event, there is an "event gap" period during which new events are not counted 
    (e.g., representing recovery time or treatment effect). This differs by treatment group 
    because the group with more events loses more time to gaps.

The theoretical sample size calculation uses exposure at risk internally, but reports both 
metrics for transparency.

```{r summary_table, eval=!is.null(results)}
# ---- Compute all metrics ----

# Number of simulations
n_sims <- sum(!is.na(results$estimate))

# Total Exposure (calendar follow-up time)
# Note: exposure is the same for both arms in the design (by symmetry)
theo_exposure <- design_ref$exposure[1]

# Check which column names are available in the results
# (Support both old and new naming conventions)
has_new_cols <- "exposure_total_control" %in% names(results)

if (has_new_cols) {
  obs_exposure_ctrl <- mean(results$exposure_total_control)
  obs_exposure_exp <- mean(results$exposure_total_experimental)
  obs_exposure_at_risk_ctrl <- mean(results$exposure_at_risk_control)
  obs_exposure_at_risk_exp <- mean(results$exposure_at_risk_experimental)
} else {
  # Legacy: old simulation used 'exposure_control' which was actually at-risk time
  obs_exposure_ctrl <- NA

  obs_exposure_exp <- NA
  obs_exposure_at_risk_ctrl <- mean(results$exposure_control)
  obs_exposure_at_risk_exp <- mean(results$exposure_experimental)
}

# Exposure at risk (time at risk excluding event gaps)
theo_exposure_at_risk_ctrl <- design_ref$exposure_at_risk_n1
theo_exposure_at_risk_exp <- design_ref$exposure_at_risk_n2

# Events by treatment group
theo_events_ctrl <- design_ref$events_n1
theo_events_exp <- design_ref$events_n2
obs_events_ctrl <- mean(results$events_control)
obs_events_exp <- mean(results$events_experimental)

# Treatment effect
true_rr <- lambda2 / lambda1
true_log_rr <- log(true_rr)
mean_log_rr <- mean(results$estimate, na.rm = TRUE)

# Variance
theo_var <- design_ref$variance
# Use median of SE^2 for robust estimate
median_se_sq <- median(results$se^2, na.rm = TRUE)
# Empirical variance of estimates
emp_var <- var(results$estimate, na.rm = TRUE)

# Power
theo_power <- design_ref$power
emp_power <- mean(results$p_value < design_ref$inputs$alpha, na.rm = TRUE)

# Sample size reproduction
z_alpha <- qnorm(1 - design_ref$inputs$alpha)
z_beta <- qnorm(design_ref$inputs$power)
n_sim_total <- design_ref$n_total
n_reproduced <- n_sim_total * (emp_var * (z_alpha + z_beta)^2) / (mean_log_rr^2)

# ---- Build summary table ----
summary_df <- data.frame(
  Metric = c(
    "Total Exposure (months) - Control",
    "Total Exposure (months) - Experimental",
    "Exposure at Risk (months) - Control",
    "Exposure at Risk (months) - Experimental",
    "Events per Subject - Control",
    "Events per Subject - Experimental",
    "Treatment Effect: log(RR)",
    "Variance of log(RR)",
    "Power",
    "Sample Size"
  ),
  Theoretical = c(
    theo_exposure,
    theo_exposure,
    theo_exposure_at_risk_ctrl,
    theo_exposure_at_risk_exp,
    theo_events_ctrl / (n_sim_total / 2),
    theo_events_exp / (n_sim_total / 2),
    true_log_rr,
    theo_var,
    theo_power,
    n_sim_total
  ),
  Simulated = c(
    obs_exposure_ctrl,
    obs_exposure_exp,
    obs_exposure_at_risk_ctrl,
    obs_exposure_at_risk_exp,
    obs_events_ctrl / (n_sim_total / 2),
    obs_events_exp / (n_sim_total / 2),
    mean_log_rr,
    median_se_sq,
    emp_power,
    n_reproduced
  ),
  stringsAsFactors = FALSE
)

summary_df$Difference <- summary_df$Simulated - summary_df$Theoretical
summary_df$Rel_Diff_Pct <- 100 * summary_df$Difference / abs(summary_df$Theoretical)

summary_df |>
  gt() |>
  tab_header(
    title = md("**Verification of sample_size_nbinom() Predictions**"),
    subtitle = paste0("Based on ", n_sims, " simulated trials")
  ) |>
  tab_row_group(
    label = md("**Sample Size**"),
    rows = Metric == "Sample Size"
  ) |>
  tab_row_group(
    label = md("**Power**"),
    rows = Metric == "Power"
  ) |>
  tab_row_group(
    label = md("**Variance**"),
    rows = Metric == "Variance of log(RR)"
  ) |>
  tab_row_group(
    label = md("**Treatment Effect**"),
    rows = Metric == "Treatment Effect: log(RR)"
  ) |>
  tab_row_group(
    label = md("**Events**"),
    rows = grepl("Events", Metric)
  ) |>
  tab_row_group(
    label = md("**Exposure**"),
    rows = grepl("Exposure", Metric)
  ) |>
  row_group_order(groups = c("**Exposure**", "**Events**", "**Treatment Effect**",
                              "**Variance**", "**Power**", "**Sample Size**")) |>
  fmt_number(columns = c(Theoretical, Simulated, Difference), decimals = 4) |>
  fmt_number(columns = Rel_Diff_Pct, decimals = 2) |>
  cols_label(
    Metric = "",
    Theoretical = "Theoretical",
    Simulated = "Simulated",
    Difference = "Difference",
    Rel_Diff_Pct = "Rel. Diff (%)"
  ) |>
  sub_missing(missing_text = "—")
```

**Notes:**

*   **Total Exposure** is the average calendar follow-up time per subject. The theoretical value is 
    the same for both arms (symmetric design), while simulated values may differ slightly due to 
    sampling variability in enrollment and dropout.
*   **Exposure at Risk** is the time during which new events can occur, after subtracting the 
    "event gap" periods following each event. The control group has lower exposure at risk because 
    it experiences more events (higher rate), each followed by a gap period.
*   **Events per Subject** is the mean number of events per subject in each treatment group.
*   **Variance** uses the median of the estimated variances (SE²) from each simulation, which is robust 
    to outliers from unstable model fits.
*   **Sample Size** is back-calculated using the standard asymptotic formula with the observed 
    treatment effect and variance.

#### Power confidence interval

A 95% confidence interval for the empirical power confirms that the theoretical power falls within 
the simulation error bounds.

```{r power_ci, eval=!is.null(results)}
power_ci <- binom.test(sum(results$p_value < design_ref$inputs$alpha, na.rm = TRUE), 
   nrow(results))$conf.int
cat("95% CI for empirical power: [", round(power_ci[1], 4), ", ", round(power_ci[2], 4), "]\n", sep = "")
cat("Theoretical power:", round(theo_power, 4), "\n")
```

### Distribution of the test statistic

Here we plot the density of the Wald Z-scores from the simulation and compare it to the expected normal distribution centered at the theoretical mean Z-score.

```{r z_score_dist, eval=!is.null(results)}
# Calculate Z-scores using estimated variance (Wald statistic)
# Z = (hat(delta) - 0) / SE_est
z_scores <- results$estimate / results$se

# Theoretical mean Z-score under the alternative
# E[Z] = log(RR) / sqrt(V_theo)
theo_se <- sqrt(theo_var)
theo_mean_z <- log(lambda2 / lambda1) / theo_se

# Critical value for 1-sided alpha (since we are looking at lower tail for RR < 1)
# However, the Z-scores here are negative (log(0.3/0.4) < 0).
# Rejection region is Z < qnorm(alpha)
crit_val <- qnorm(design_ref$inputs$alpha)

# Proportion of simulations rejecting the null
prop_reject <- mean(z_scores < crit_val, na.rm = TRUE)

# Plot
ggplot(data.frame(z = z_scores), aes(x = z)) +
  geom_density(aes(color = "Simulated Density"), linewidth = 1) +
  stat_function(
    fun = dnorm,
    args = list(mean = theo_mean_z, sd = 1),
    aes(color = "Theoretical Normal"),
    linewidth = 1,
    linetype = "dashed"
  ) +
  geom_vline(xintercept = crit_val, linetype = "dotted", color = "black") +
  annotate(
    "text",
    x = crit_val, y = 0.05,
    label = paste0("  Reject: ", round(prop_reject * 100, 1), "%"),
    hjust = 0, vjust = 0
  ) +
  labs(
    title = "Distribution of Wald Z-scores",
    subtitle = paste("Theoretical Mean Z =", round(theo_mean_z, 3)),
    x = "Z-score (Estimate / Estimated SE)",
    y = "Density",
    color = "Distribution"
  ) +
  theme_minimal() +
  theme(legend.position = "top")
```

Given the apparent location difference in the above plot for the simulation versus theoretical normal distribution, 
we can also examine the theoretical versus simulation mean and standard deviation of $log(\theta)$ .

```{r log_rr_stats, eval=!is.null(results)}
# Theoretical mean and SD of log(RR)
# Let's also add median, skewness, and kurtosis
theo_mean_log_rr <- log(lambda2 / lambda1)
theo_sd_log_rr <- sqrt(theo_var)
emp_mean_log_rr <- mean(results$estimate, na.rm = TRUE)
emp_sd_log_rr <- sd(results$estimate, na.rm = TRUE)
emp_median_log_rr <- median(results$estimate, na.rm = TRUE)

# Skewness and Kurtosis (using trimmed data to reduce outlier influence)
# Trim extreme 1% on each tail for robust estimates
ests <- results$estimate[!is.na(results$estimate)]
q_low <- quantile(ests, 0.01)
q_high <- quantile(ests, 0.99)
ests_trimmed <- ests[ests >= q_low & ests <= q_high]
n_trimmed <- length(ests_trimmed)
m3 <- sum((ests_trimmed - mean(ests_trimmed))^3) / n_trimmed
m4 <- sum((ests_trimmed - mean(ests_trimmed))^4) / n_trimmed
s2 <- sum((ests_trimmed - mean(ests_trimmed))^2) / n_trimmed
emp_skew_log_rr <- m3 / s2^(3/2)
emp_kurt_log_rr <- m4 / s2^2

comparison_log_rr <- data.frame(
  Metric = c("Mean", "SD", "Median", "Skewness (trimmed)", "Kurtosis (trimmed)"),
  Theoretical = c(theo_mean_log_rr, theo_sd_log_rr, theo_mean_log_rr, 0, 3),
  Simulated = c(emp_mean_log_rr, emp_sd_log_rr, emp_median_log_rr, emp_skew_log_rr, emp_kurt_log_rr),
  Difference = c(
    emp_mean_log_rr - theo_mean_log_rr,
    emp_sd_log_rr - theo_sd_log_rr,
    emp_median_log_rr - theo_mean_log_rr,
    emp_skew_log_rr - 0,
    emp_kurt_log_rr - 3
  )
)
# Display table
comparison_log_rr |>
  gt() |>
  tab_header(title = md("**Comparison of log(RR) Statistics**")) |>
  fmt_number(columns = where(is.numeric), decimals = 4)
```

## Conclusion

The simulation results confirm that `sample_size_nbinom()` reasonably predicts average exposure, variance (information), 
and power for this complex design with piecewise accrual and dropout.
However, given the slight underpowering that the simulation study suggests, 
it may be useful to consider a larger sample size than the Friede approximation suggests, with power verified by simulation.