---
title: "Association Analysis for UKB Outcomes"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Association Analysis for UKB Outcomes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = FALSE
)
```

## Overview

The `assoc_*` functions fit regression models for each exposure variable and
return tidy result tables suitable for downstream forest plots and publication
tables.

| Function | Alias | Model | Effect measure |
|---|---|---|---|
| `assoc_coxph()` | `assoc_cox()` | Cox proportional hazards | HR |
| `assoc_logistic()` | `assoc_logit()` | Logistic regression | OR |
| `assoc_linear()` | `assoc_lm()` | Linear regression | beta |
| `assoc_coxph_zph()` | `assoc_zph()` | Schoenfeld residual PH test | chisq / p |
| `assoc_subgroup()` | `assoc_sub()` | Stratified analysis + LRT interaction | HR / OR / beta |
| `assoc_trend()` | `assoc_tr()` | Dose-response trend | HR / OR / beta + p_trend |
| `assoc_competing()` | `assoc_fg()` | Fine-Gray competing risks | SHR |
| `assoc_lag()` | — | Cox lag sensitivity analysis | HR |

> **Prerequisite**: the analysis dataset should already contain derived case
> status, follow-up time, and covariates produced by the `derive_*` functions.
> See `vignette("derive")` and `vignette("derive-survival")`.

```{r setup-data}
library(ukbflow)

# ops_toy(scenario = "association") returns a pre-derived analysis-ready table:
# covariates already as factors, bmi_cat / tdi_cat binned, and two outcomes
# (dm_* and htn_*) with status / date / timing / followup columns in place.
dt <- ops_toy(scenario = "association")
dt <- dt[dm_timing != 1L]   # incident analysis: exclude prevalent DM cases
```

---

## The Three-Model Framework

All main functions automatically produce up to three adjustment levels without
requiring manual formula construction:

| Model | Covariates | When included |
|---|---|---|
| **Unadjusted** | None (crude) | Always (when `base = TRUE`) |
| **Age and sex adjusted** | Age (field 21022) + sex (field 31), auto-detected | When both columns are found |
| **Fully adjusted** | User-supplied `covariates` | When `covariates` is non-NULL |

Age and sex columns are located automatically by scanning the dataset's column
names for standard UKB naming patterns (`p21022_*` for age at recruitment,
`p31_*` for sex). As a fallback, any column starting with `"age"` or named
`"sex"` is also recognised. No prerequisite pipeline step is required — the
detection works on any data.frame whose columns follow UKB or ukbflow naming
conventions. Set `base = FALSE` to skip the first two models and run only the
Fully adjusted model.

---

## Step 1: Cox Proportional Hazards

`assoc_coxph()` is the primary function for time-to-event outcomes. It accepts
logical (`TRUE`/`FALSE`) or integer (`0`/`1`) event indicators and returns one
row per exposure x term x model combination.

```{r assoc-coxph}
# Crude + age-sex adjusted (automatic); p21022 and p31 auto-detected
res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat")
)
```

```{r assoc-coxph-full}
# Add a Fully adjusted model
res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0",
                   paste0("p22009_a", 1:10))
)
```

Output columns:

| Column | Description |
|---|---|
| `exposure` | Exposure variable name |
| `term` | Coefficient name from `coxph` |
| `model` | Ordered factor: Unadjusted < Age and sex adjusted < Fully adjusted |
| `n` | Participants in model (after NA removal) |
| `n_events` | Events in model |
| `person_years` | Total person-years (rounded) |
| `HR` | Hazard ratio |

> `n`, `n_events`, and `person_years` all reflect the model-specific
> complete-case analysis set — participants with any missing value across
> outcome, time, exposure, or covariates are dropped before fitting. These
> counts therefore differ between models (e.g. Fully adjusted typically has
> fewer participants than Unadjusted).
| `CI_lower`, `CI_upper` | Confidence interval bounds |
| `p_value` | Wald test p-value |
| `HR_label` | Formatted string, e.g. `"1.43 (1.28-1.60)"` |

---

## Step 2: Logistic Regression

`assoc_logistic()` is for binary outcomes without a time dimension (e.g.
case-control or cross-sectional designs).

```{r assoc-logistic}
res <- assoc_logistic(
  data         = dt,
  outcome_col  = "dm_status",
  exposure_col = c("p20116_i0", "bmi_cat"),
  covariates   = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10))
)
```

For sparse data or small samples, use profile likelihood confidence intervals:

```{r assoc-logistic-profile}
res <- assoc_logistic(
  data         = dt,
  outcome_col  = "dm_status",
  exposure_col = "grs_bmi",
  ci_method    = "profile"   # slower but more accurate for sparse data
)
```

Output is identical in structure to `assoc_coxph()` but with `OR` and
`OR_label` in place of `HR` / `HR_label`, and `n_cases` instead of
`n_events` / `person_years`.

---

## Step 3: Linear Regression

`assoc_linear()` is for continuous outcomes (e.g. biomarker levels, BMI).
The standard error of beta is included to support downstream meta-analysis.

```{r assoc-linear}
# Continuous outcome (BMI); smoking and GRS as exposures
res <- assoc_linear(
  data         = dt,
  outcome_col  = "p21001_i0",
  exposure_col = c("p20116_i0", "grs_bmi"),
  covariates   = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10))
)
```

> **Common mistake**: passing a binary (0/1) or logical column as
> `outcome_col` is permitted (linear probability model) but triggers a
> warning. Use `assoc_logistic()` for binary outcomes — linear regression
> on a 0/1 outcome produces unbounded predictions and is rarely appropriate
> in epidemiological analyses.

---

## Step 4: Proportional Hazards Assumption Test

`assoc_coxph_zph()` re-fits the same Cox models and tests the PH assumption
via Schoenfeld residuals (`cox.zph()`). Use it alongside `assoc_coxph()` to
validate model assumptions.

```{r assoc-zph}
zph <- assoc_coxph_zph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat"),
  covariates   = c("tdi_cat", "p1558_i0")
)

# Identify any violations
zph[ph_satisfied == FALSE]
```

Output columns include `chisq`, `df`, `p_value`, `ph_satisfied` (logical),
and global test statistics (`global_chisq`, `global_df`, `global_p`).

When the PH assumption is violated, consider adding `strata()` to
`assoc_coxph()` or modelling time-varying effects.

---

## Step 5: Subgroup Analysis

`assoc_subgroup()` stratifies the dataset by a grouping variable and fits
the specified model within each subgroup. A likelihood ratio test (LRT) for
the exposure x subgroup interaction is computed on the full dataset and
appended as `p_interaction`.

```{r assoc-subgroup}
# Subgroup by sex; p31 is automatically excluded from within-stratum models
res <- assoc_subgroup(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  by           = "p31",
  method       = "coxph",
  covariates   = c("p21022", "bmi_cat", "tdi_cat")
)
```

Output columns include `subgroup`, `subgroup_level`, and `p_interaction` in
addition to the standard effect estimate columns.

> **Note**: the `by` variable is automatically excluded from the subgroup
> models. Do not include it in `covariates` — a collinearity warning is
> issued if you do. Subgroup analysis is most interpretable when `by` has a
> small number of levels (e.g. sex, smoking status); continuous variables are
> technically permitted but per-unique-value subgrouping is rarely meaningful
> in practice — use a pre-categorised version (e.g. via `derive_cut()`)
> instead.

---

## Step 6: Dose-Response Trend Analysis

`assoc_trend()` fits categorical and trend models simultaneously for each
ordered-factor exposure, returning per-category estimates alongside a p-value
for linear trend.

```{r assoc-trend-prep}
# assoc_trend() requires an ordered factor; make bmi_cat ordered in-place
dt[, bmi_cat := factor(bmi_cat, levels = levels(bmi_cat), ordered = TRUE)]
```

```{r assoc-trend}
res <- assoc_trend(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "bmi_cat",
  method       = "coxph",
  covariates   = c("p21022", "p31", "tdi_cat", "p20116_i0")
)
```

Supply custom scores reflecting approximate median BMI per category:

```{r assoc-trend-scores}
res <- assoc_trend(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "bmi_cat",
  method       = "coxph",
  covariates   = c("p21022", "p31", "tdi_cat", "p20116_i0"),
  scores       = c(17, 22, 27, 35)   # approximate median BMI per category
)
```

The output contains a reference row (`HR = 1.00 (ref)`) followed by
non-reference rows, with `HR_per_score`, `HR_per_score_label`, and `p_trend`
appended as shared columns. These trend columns carry the same value across
every row within the same exposure × model combination — including the
reference row — so the full result table remains self-contained and easy to
filter or export.

---

## Step 7: Fine-Gray Competing Risks

`assoc_competing()` fits a Fine-Gray subdistribution hazard model via
`survival::finegray()` + inverse-probability-of-censoring-weighted Cox
regression. Use this when a competing event (e.g. death) can preclude the
outcome of interest.

Two input modes are supported:

**Mode A** — a single column encodes all event types:

```{r assoc-competing-a}
# Construct a combined event column: 0 = censored, 1 = DM, 2 = HTN (competing)
dt_cr <- dt[htn_timing != 1L]   # also exclude prevalent HTN
dt_cr[, event_type := data.table::fcase(
  dm_timing  == 2L, 1L,
  htn_timing == 2L, 2L,
  default          = 0L
)]

res <- assoc_competing(
  data         = dt_cr,
  outcome_col  = "event_type",       # 0 = censored, 1 = DM, 2 = HTN
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  event_val    = 1L,
  compete_val  = 2L,
  covariates   = c("bmi_cat", "tdi_cat")
)
```

**Mode B** — separate 0/1 columns for primary and competing events:

```{r assoc-competing-b}
res <- assoc_competing(
  data         = dt_cr,
  outcome_col  = "dm_status",        # primary event
  time_col     = "dm_followup_years",
  exposure_col = c("p20116_i0", "bmi_cat"),
  compete_col  = "htn_status",       # competing event
  covariates   = c("tdi_cat", "p1558_i0")
)
```

Output uses `SHR` (subdistribution hazard ratio) and `SHR_label` in place of
HR, and adds `n_compete` (number of competing events in the analysis set).

---

## Working with Results

All functions return a `data.table` that feeds directly into `plot_forest()`:

```{r downstream}
res <- assoc_coxph(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0")
)

# Pass directly to plot_forest()
plot_forest(as.data.frame(res))

# Filter to a single model
res[model == "Fully adjusted"]

# Export
data.table::fwrite(res, "assoc_results.csv")
```

---

## Step 8: Lag Sensitivity Analysis

`assoc_lag()` re-runs the same Cox models at one or more lag periods to assess
whether associations are driven by early events (reverse causation or detection
bias). For each lag, participants whose follow-up time is less than
`lag_years` are excluded; follow-up is kept on its original scale.

```{r assoc-lag}
res <- assoc_lag(
  data         = dt,
  outcome_col  = "dm_status",
  time_col     = "dm_followup_years",
  exposure_col = "p20116_i0",
  lag_years    = c(0, 1, 2),   # 0 = full cohort reference
  covariates   = c("bmi_cat", "tdi_cat", "p1558_i0")
)
```

Setting `lag_years = 0` (or including `0` in the vector) runs the model on the
full unfiltered cohort, providing a reference row against which lagged results
can be compared.

Output adds two columns to the standard `assoc_coxph()` result:

| Column | Description |
|---|---|
| `lag_years` | Lag period applied (numeric, same units as `time_col`) |
| `n_excluded` | Participants excluded because follow-up < `lag_years` |

---

## Getting Help

- `?assoc_coxph`, `?assoc_logistic`, `?assoc_linear`
- `?assoc_coxph_zph`, `?assoc_subgroup`, `?assoc_trend`, `?assoc_competing`, `?assoc_lag`
- `vignette("derive-survival")` — follow-up time and event derivation
- `vignette("derive")` — disease phenotype derivation
- [GitHub Issues](https://github.com/evanbio/ukbflow/issues)