--- title: "Survival Analysis Setup for UKB Outcomes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Survival Analysis Setup for UKB Outcomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview After disease case definitions have been derived (see `vignette("derive")`), three additional functions prepare the data for time-to-event analysis: | Function | Output columns | Purpose | |---|---|---| | `derive_timing()` | `{name}_timing` | Classify prevalent vs. incident disease | | `derive_age()` | `age_at_{name}` | Age at event (years) | | `derive_followup()` | `{name}_followup_end`, `{name}_followup_years` | Follow-up end date and duration | > **Prerequisite**: `{name}_status` and `{name}_date` must already be present — > produced by `vignette("derive")`. The examples below assume the full > disease derivation pipeline has been run on an `ops_toy()` dataset, so > the baseline date column is `p53_i0` and age at recruitment is `p21022`. --- ## Step 1: Classify Timing — Prevalent vs. Incident `derive_timing()` compares the disease date to the UKB baseline assessment date and assigns each participant to one of four categories: | Value | Meaning | |---|---| | `0` | No disease (`status` is `FALSE`) | | `1` | **Prevalent** — disease date on or before baseline | | `2` | **Incident** — disease date strictly after baseline | | `NA` | Case with no recorded date; timing cannot be determined | ```{r setup-data} library(ukbflow) # Build on the derive pipeline from vignette("derive") df <- ops_toy(n = 500) df <- derive_missing(df) df <- derive_covariate(df, as_factor = c("p31", "p20116_i0")) df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes") df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death")) df <- derive_case(df, name = "dm") ``` ```{r derive-timing} # Uses {name}_status and {name}_date by default df <- derive_timing(df, name = "dm", baseline_col = "p53_i0") ``` Supply explicit column names when the defaults do not apply: ```{r derive-timing-explicit} df <- derive_timing(df, name = "dm", status_col = "dm_status", date_col = "dm_date", baseline_col = "p53_i0" ) ``` Call once per variable needed — for example, once for the combined case and once per individual source (HES, self-report, etc.). --- ## Step 2: Age at Event `derive_age()` computes age at the time of the event for cases, and returns `NA` for non-cases and cases without a date. $$\text{age\_at\_event} = \text{age\_at\_recruitment} + \frac{\text{event\_date} - \text{baseline\_date}}{365.25}$$ The divisor 365.25 accounts for leap years, ensuring sub-monthly precision in age calculation across the full UKB follow-up window. ```{r derive-age} # Auto-detects {name}_date and {name}_status; produces age_at_{name} column. df <- derive_age(df, name = "dm", baseline_col = "p53_i0", age_col = "p21022" ) ``` Supply explicit column mappings when names do not follow the default `{name}_date` / `{name}_status` pattern: ```{r derive-age-explicit} df <- derive_age(df, name = "dm", baseline_col = "p53_i0", age_col = "p21022", date_cols = c(dm = "dm_date"), status_cols = c(dm = "dm_status") ) ``` --- ## Step 3: Follow-Up Time `derive_followup()` computes the follow-up end date as the **earliest** of: 1. The outcome event date (if the participant is a case) 2. Date of death (field 40000; competing event) 3. Date lost to follow-up (field 191) 4. The administrative censoring date Follow-up time in years is then derived from the baseline date. ```{r derive-followup} df <- derive_followup(df, name = "dm", event_col = "dm_date", baseline_col = "p53_i0", censor_date = as.Date("2022-10-31"), # set to your study's cut-off date death_col = "p40000_i0", lost_col = FALSE # not available in ops_toy ) ``` Output columns: | Column | Type | Description | |---|---|---| | `dm_followup_end` | IDate | Earliest competing date | | `dm_followup_years` | numeric | Years from baseline to end | ### Prevalent cases receive `NA` follow-up time Participants whose event date falls **before or on the baseline date** (prevalent cases, `{name}_timing == 1`) will have `followup_years` set to `NA` rather than a zero or negative value, which has no meaning in time-to-event analysis. Use `derive_timing()` to identify and exclude prevalent cases before fitting a Cox model (see the full pipeline example below). ### Auto-detection of death and lost-to-follow-up columns When `death_col` and `lost_col` are `NULL` (default), `derive_followup()` looks them up automatically from the field cache (UKB fields 40000 and 191). Pass `FALSE` to explicitly disable a competing event: ```{r derive-followup-nodeath} df <- derive_followup(df, name = "dm", event_col = "dm_date", baseline_col = "p53_i0", censor_date = as.Date("2022-10-31"), death_col = FALSE, lost_col = FALSE ) ``` --- ## Full Survival-Ready Pipeline After completing all three steps, the data contains everything needed to fit a Cox proportional hazards model: ```{r cox-example} library(survival) # Incident analysis: exclude prevalent cases and those with undetermined timing df_incident <- df[dm_timing != 1L] fit <- coxph( Surv(dm_followup_years, dm_status) ~ p20116_i0 + p21022 + p31 + p1558_i0, data = df_incident ) summary(fit) ``` Column roles in the model: | Column | Role | |---|---| | `dm_status` | Event indicator (logical) | | `dm_followup_years` | Time variable | | `dm_timing` | Filter: exclude prevalent (`== 1`) | | `age_at_dm` | Age at diagnosis (descriptive / secondary analysis) | | `p20116_i0` | Exposure of interest (smoking status) | --- ## Getting Help - `?derive_timing`, `?derive_age`, `?derive_followup` - `vignette("derive")` — disease phenotype derivation - `vignette("decode")` — decoding column names and values - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)