--- title: "Association Analysis for UKB Outcomes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Association Analysis for UKB Outcomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The `assoc_*` functions fit regression models for each exposure variable and return tidy result tables suitable for downstream forest plots and publication tables. | Function | Alias | Model | Effect measure | |---|---|---|---| | `assoc_coxph()` | `assoc_cox()` | Cox proportional hazards | HR | | `assoc_logistic()` | `assoc_logit()` | Logistic regression | OR | | `assoc_linear()` | `assoc_lm()` | Linear regression | beta | | `assoc_coxph_zph()` | `assoc_zph()` | Schoenfeld residual PH test | chisq / p | | `assoc_subgroup()` | `assoc_sub()` | Stratified analysis + LRT interaction | HR / OR / beta | | `assoc_trend()` | `assoc_tr()` | Dose-response trend | HR / OR / beta + p_trend | | `assoc_competing()` | `assoc_fg()` | Fine-Gray competing risks | SHR | | `assoc_lag()` | — | Cox lag sensitivity analysis | HR | > **Prerequisite**: the analysis dataset should already contain derived case > status, follow-up time, and covariates produced by the `derive_*` functions. > See `vignette("derive")` and `vignette("derive-survival")`. ```{r setup-data} library(ukbflow) # ops_toy(scenario = "association") returns a pre-derived analysis-ready table: # covariates already as factors, bmi_cat / tdi_cat binned, and two outcomes # (dm_* and htn_*) with status / date / timing / followup columns in place. dt <- ops_toy(scenario = "association") dt <- dt[dm_timing != 1L] # incident analysis: exclude prevalent DM cases ``` --- ## The Three-Model Framework All main functions automatically produce up to three adjustment levels without requiring manual formula construction: | Model | Covariates | When included | |---|---|---| | **Unadjusted** | None (crude) | Always (when `base = TRUE`) | | **Age and sex adjusted** | Age (field 21022) + sex (field 31), auto-detected | When both columns are found | | **Fully adjusted** | User-supplied `covariates` | When `covariates` is non-NULL | Age and sex columns are located automatically by scanning the dataset's column names for standard UKB naming patterns (`p21022_*` for age at recruitment, `p31_*` for sex). As a fallback, any column starting with `"age"` or named `"sex"` is also recognised. No prerequisite pipeline step is required — the detection works on any data.frame whose columns follow UKB or ukbflow naming conventions. Set `base = FALSE` to skip the first two models and run only the Fully adjusted model. --- ## Step 1: Cox Proportional Hazards `assoc_coxph()` is the primary function for time-to-event outcomes. It accepts logical (`TRUE`/`FALSE`) or integer (`0`/`1`) event indicators and returns one row per exposure x term x model combination. ```{r assoc-coxph} # Crude + age-sex adjusted (automatic); p21022 and p31 auto-detected res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = c("p20116_i0", "bmi_cat") ) ``` ```{r assoc-coxph-full} # Add a Fully adjusted model res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat", "p1558_i0", paste0("p22009_a", 1:10)) ) ``` Output columns: | Column | Description | |---|---| | `exposure` | Exposure variable name | | `term` | Coefficient name from `coxph` | | `model` | Ordered factor: Unadjusted < Age and sex adjusted < Fully adjusted | | `n` | Participants in model (after NA removal) | | `n_events` | Events in model | | `person_years` | Total person-years (rounded) | | `HR` | Hazard ratio | > `n`, `n_events`, and `person_years` all reflect the model-specific > complete-case analysis set — participants with any missing value across > outcome, time, exposure, or covariates are dropped before fitting. These > counts therefore differ between models (e.g. Fully adjusted typically has > fewer participants than Unadjusted). | `CI_lower`, `CI_upper` | Confidence interval bounds | | `p_value` | Wald test p-value | | `HR_label` | Formatted string, e.g. `"1.43 (1.28-1.60)"` | --- ## Step 2: Logistic Regression `assoc_logistic()` is for binary outcomes without a time dimension (e.g. case-control or cross-sectional designs). ```{r assoc-logistic} res <- assoc_logistic( data = dt, outcome_col = "dm_status", exposure_col = c("p20116_i0", "bmi_cat"), covariates = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10)) ) ``` For sparse data or small samples, use profile likelihood confidence intervals: ```{r assoc-logistic-profile} res <- assoc_logistic( data = dt, outcome_col = "dm_status", exposure_col = "grs_bmi", ci_method = "profile" # slower but more accurate for sparse data ) ``` Output is identical in structure to `assoc_coxph()` but with `OR` and `OR_label` in place of `HR` / `HR_label`, and `n_cases` instead of `n_events` / `person_years`. --- ## Step 3: Linear Regression `assoc_linear()` is for continuous outcomes (e.g. biomarker levels, BMI). The standard error of beta is included to support downstream meta-analysis. ```{r assoc-linear} # Continuous outcome (BMI); smoking and GRS as exposures res <- assoc_linear( data = dt, outcome_col = "p21001_i0", exposure_col = c("p20116_i0", "grs_bmi"), covariates = c("tdi_cat", "p1558_i0", paste0("p22009_a", 1:10)) ) ``` > **Common mistake**: passing a binary (0/1) or logical column as > `outcome_col` is permitted (linear probability model) but triggers a > warning. Use `assoc_logistic()` for binary outcomes — linear regression > on a 0/1 outcome produces unbounded predictions and is rarely appropriate > in epidemiological analyses. --- ## Step 4: Proportional Hazards Assumption Test `assoc_coxph_zph()` re-fits the same Cox models and tests the PH assumption via Schoenfeld residuals (`cox.zph()`). Use it alongside `assoc_coxph()` to validate model assumptions. ```{r assoc-zph} zph <- assoc_coxph_zph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = c("p20116_i0", "bmi_cat"), covariates = c("tdi_cat", "p1558_i0") ) # Identify any violations zph[ph_satisfied == FALSE] ``` Output columns include `chisq`, `df`, `p_value`, `ph_satisfied` (logical), and global test statistics (`global_chisq`, `global_df`, `global_p`). When the PH assumption is violated, consider adding `strata()` to `assoc_coxph()` or modelling time-varying effects. --- ## Step 5: Subgroup Analysis `assoc_subgroup()` stratifies the dataset by a grouping variable and fits the specified model within each subgroup. A likelihood ratio test (LRT) for the exposure x subgroup interaction is computed on the full dataset and appended as `p_interaction`. ```{r assoc-subgroup} # Subgroup by sex; p31 is automatically excluded from within-stratum models res <- assoc_subgroup( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", by = "p31", method = "coxph", covariates = c("p21022", "bmi_cat", "tdi_cat") ) ``` Output columns include `subgroup`, `subgroup_level`, and `p_interaction` in addition to the standard effect estimate columns. > **Note**: the `by` variable is automatically excluded from the subgroup > models. Do not include it in `covariates` — a collinearity warning is > issued if you do. Subgroup analysis is most interpretable when `by` has a > small number of levels (e.g. sex, smoking status); continuous variables are > technically permitted but per-unique-value subgrouping is rarely meaningful > in practice — use a pre-categorised version (e.g. via `derive_cut()`) > instead. --- ## Step 6: Dose-Response Trend Analysis `assoc_trend()` fits categorical and trend models simultaneously for each ordered-factor exposure, returning per-category estimates alongside a p-value for linear trend. ```{r assoc-trend-prep} # assoc_trend() requires an ordered factor; make bmi_cat ordered in-place dt[, bmi_cat := factor(bmi_cat, levels = levels(bmi_cat), ordered = TRUE)] ``` ```{r assoc-trend} res <- assoc_trend( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "bmi_cat", method = "coxph", covariates = c("p21022", "p31", "tdi_cat", "p20116_i0") ) ``` Supply custom scores reflecting approximate median BMI per category: ```{r assoc-trend-scores} res <- assoc_trend( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "bmi_cat", method = "coxph", covariates = c("p21022", "p31", "tdi_cat", "p20116_i0"), scores = c(17, 22, 27, 35) # approximate median BMI per category ) ``` The output contains a reference row (`HR = 1.00 (ref)`) followed by non-reference rows, with `HR_per_score`, `HR_per_score_label`, and `p_trend` appended as shared columns. These trend columns carry the same value across every row within the same exposure × model combination — including the reference row — so the full result table remains self-contained and easy to filter or export. --- ## Step 7: Fine-Gray Competing Risks `assoc_competing()` fits a Fine-Gray subdistribution hazard model via `survival::finegray()` + inverse-probability-of-censoring-weighted Cox regression. Use this when a competing event (e.g. death) can preclude the outcome of interest. Two input modes are supported: **Mode A** — a single column encodes all event types: ```{r assoc-competing-a} # Construct a combined event column: 0 = censored, 1 = DM, 2 = HTN (competing) dt_cr <- dt[htn_timing != 1L] # also exclude prevalent HTN dt_cr[, event_type := data.table::fcase( dm_timing == 2L, 1L, htn_timing == 2L, 2L, default = 0L )] res <- assoc_competing( data = dt_cr, outcome_col = "event_type", # 0 = censored, 1 = DM, 2 = HTN time_col = "dm_followup_years", exposure_col = "p20116_i0", event_val = 1L, compete_val = 2L, covariates = c("bmi_cat", "tdi_cat") ) ``` **Mode B** — separate 0/1 columns for primary and competing events: ```{r assoc-competing-b} res <- assoc_competing( data = dt_cr, outcome_col = "dm_status", # primary event time_col = "dm_followup_years", exposure_col = c("p20116_i0", "bmi_cat"), compete_col = "htn_status", # competing event covariates = c("tdi_cat", "p1558_i0") ) ``` Output uses `SHR` (subdistribution hazard ratio) and `SHR_label` in place of HR, and adds `n_compete` (number of competing events in the analysis set). --- ## Working with Results All functions return a `data.table` that feeds directly into `plot_forest()`: ```{r downstream} res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat", "p1558_i0") ) # Pass directly to plot_forest() plot_forest(as.data.frame(res)) # Filter to a single model res[model == "Fully adjusted"] # Export data.table::fwrite(res, "assoc_results.csv") ``` --- ## Step 8: Lag Sensitivity Analysis `assoc_lag()` re-runs the same Cox models at one or more lag periods to assess whether associations are driven by early events (reverse causation or detection bias). For each lag, participants whose follow-up time is less than `lag_years` are excluded; follow-up is kept on its original scale. ```{r assoc-lag} res <- assoc_lag( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", lag_years = c(0, 1, 2), # 0 = full cohort reference covariates = c("bmi_cat", "tdi_cat", "p1558_i0") ) ``` Setting `lag_years = 0` (or including `0` in the vector) runs the model on the full unfiltered cohort, providing a reference row against which lagged results can be compared. Output adds two columns to the standard `assoc_coxph()` result: | Column | Description | |---|---| | `lag_years` | Lag period applied (numeric, same units as `time_col`) | | `n_excluded` | Participants excluded because follow-up < `lag_years` | --- ## Getting Help - `?assoc_coxph`, `?assoc_logistic`, `?assoc_linear` - `?assoc_coxph_zph`, `?assoc_subgroup`, `?assoc_trend`, `?assoc_competing`, `?assoc_lag` - `vignette("derive-survival")` — follow-up time and event derivation - `vignette("derive")` — disease phenotype derivation - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)