--- title: "Operational Utilities: Setup, Diagnostics, and Pipeline Tracking" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Operational Utilities: Setup, Diagnostics, and Pipeline Tracking} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The `ops_*` functions are a set of lightweight utilities that sit outside the main analysis pipeline. They help you verify your environment before starting, explore data quality, and track how your cohort changes at each processing step. | Function | Purpose | |---|---| | `ops_setup()` | Check dx CLI, RAP authentication, and R package dependencies | | `ops_toy()` | Generate synthetic UKB-like data for development and testing | | `ops_na()` | Summarise missing values (NA and `""`) across all columns | | `ops_snapshot()` | Record pipeline checkpoints and track dataset changes | `ops_setup()` may query dx CLI and RAP authentication status as part of its health check. All other functions operate entirely locally: `ops_toy()` and `ops_na()` are read-only; `ops_snapshot()` and its companions track and optionally clean up columns; `ops_withdraw()` removes withdrawn participants in-place. None of them read from or write to RAP storage. --- ## `ops_setup()` — Environment Health Check Run `ops_setup()` once after installing ukbflow to confirm that all required components are in place before starting a real analysis. ```{r ops-setup} library(ukbflow) ops_setup() #> ── ukbflow environment check ────────────────────────────────────────────── #> ℹ ukbflow 0.1.0 | R 4.4.1 | 2026-03-09 #> ── 1. dx-toolkit ────────────────────────────────────────────────────────── #> ✔ dx: /usr/local/bin/dx (dx-toolkit v0.375.0) #> ── 2. RAP authentication ─────────────────────────────────────────────────── #> ✔ user: evan.zhou #> ✔ project: project-GXk9... #> ── 3. R packages ─────────────────────────────────────────────────────────── #> ✔ cli 3.6.3 [core] #> ✔ data.table 1.15.4 [core] #> ✔ survival 3.7.0 [assoc_coxph] #> ✔ forestploter 1.1.1 [plot_forest] #> ... #> ─────────────────────────────────────────────────────────────────────────── #> ✔ 15 passed #> ! 2 optional / warning ``` For programmatic use (e.g. inside scripts or CI), set `verbose = FALSE` and inspect the returned list: ```{r ops-setup-prog} result <- ops_setup(verbose = FALSE) result$summary #> $pass #> [1] 15 #> $warn #> [1] 2 #> $fail #> [1] 0 # Gate the rest of your script on a clean environment stopifnot(result$summary$fail == 0) ``` Individual checks can be disabled when only a subset is needed: ```{r ops-setup-partial} # Check R package dependencies only (skip dx and RAP auth) ops_setup(check_dx = FALSE, check_auth = FALSE) ``` --- ## `ops_toy()` — Synthetic UKB Data `ops_toy()` generates a realistic but entirely synthetic dataset that mimics the structure of UKB phenotype data on the RAP. Use it to develop and test `derive_*`, `assoc_*`, and `plot_*` functions without needing real UKB data access. ### Cohort scenario The default `"cohort"` scenario produces a wide participant-level table that covers all major UKB data domains: ```{r ops-toy-cohort} dt <- ops_toy() #> ✔ ops_toy: 1000 participants | 75 columns | scenario = "cohort" | seed = 42 dim(dt) #> [1] 1000 75 names(dt) #> [1] "eid" "p31" "p34" "p53_i0" #> [5] "p21022" "p21001_i0" "p20116_i0" "p1558_i0" #> ... ``` Column groups included: | Group | Columns | |---|---| | Demographics | `eid`, `p31`, `p34`, `p53_i0`, `p21022` | | Covariates | `p21001_i0`, `p20116_i0`, `p1558_i0`, `p21000_i0`, `p22189`, `p54_i0` | | Genetic PCs | `p22009_a1` – `p22009_a10` | | Self-report disease | `p20002_i0_a0` – `a4`, `p20008_i0_a0` – `a4` | | Self-report cancer | `p20001_i0_a0` – `a4`, `p20006_i0_a0` – `a4` | | HES | `p41270` (JSON array), `p41280_a0` – `a8` | | Cancer registry | `p40006_i0` – `i2`, `p40011_i0` – `i2`, `p40012_i0` – `i2`, `p40005_i0` – `i2` | | Death registry | `p40001_i0`, `p40002_i0_a0` – `a2`, `p40000_i0` | | First occurrence | `p131742` | | GRS columns | `grs_bmi`, `grs_raw`, `grs_finngen` | | Messy columns | `messy_allna`, `messy_empty`, `messy_label` | The messy columns deliberately stress-test `derive_missing()` and `ops_na()` against common data quality issues (all-NA columns, empty strings, non-standard missing labels). Feed the output directly into the derive pipeline: ```{r ops-toy-pipeline} dt <- ops_toy() dt <- derive_missing(dt) dt <- derive_covariate(dt, as_numeric = "p21001_i0", as_factor = c("p31", "p20116_i0") ) ``` ### Forest scenario The `"forest"` scenario returns a results table matching the output of `assoc_coxph()`, useful for developing and testing `plot_forest()` without running a real Cox model: ```{r ops-toy-forest} dt_forest <- ops_toy(scenario = "forest") #> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42 plot_forest( data = dt_forest[model == "Fully adjusted"], est = dt_forest[model == "Fully adjusted", HR], lower = dt_forest[model == "Fully adjusted", CI_lower], upper = dt_forest[model == "Fully adjusted", CI_upper] ) ``` ### Reproducibility Results are reproducible by default (`seed = 42`). Pass `seed = NULL` for a different dataset on every call: ```{r ops-toy-seed} dt1 <- ops_toy(seed = 1) dt2 <- ops_toy(seed = 1) identical(dt1, dt2) # TRUE dt_random <- ops_toy(seed = NULL) # different every call ``` --- ## `ops_na()` — Missing Value Diagnostics `ops_na()` scans every column for `NA` **and empty strings (`""`)**, returning counts and percentages sorted by missingness. Counting `""` as missing is intentional — UKB exports frequently use empty strings as placeholders for absent text values, so `ops_na()` reports *effective* missingness rather than a plain `is.na()` count. It is designed to be called before `derive_missing()` to understand the data quality profile of a freshly extracted UKB dataset. ```{r ops-na-basic} dt <- ops_toy() ops_na(dt) #> ── ops_na ────────────────────────────────────────────────────────────────── #> ℹ 1000 rows | 65 columns | threshold = 0% #> ✖ messy_allna 1000 / 1000 (100.00%) #> ✖ p41280_a4 1000 / 1000 (100.00%) #> ✖ p20002_i0_a4 976 / 1000 ( 97.60%) #> ✖ p131742 916 / 1000 ( 91.60%) #> ... #> ──────────────────────────────────────────────────────────────────────────── #> ✖ 41 columns ≥ 10% missing #> ✔ 24 columns complete (0% missing) ``` Columns with ≥ 10% missing are flagged in red (`✖`); those between 0% and 10% in yellow (`!`). The summary block (totals) is always printed regardless of the `threshold` setting. ### Controlling CLI output with `threshold` Use `threshold` to silence low-missingness columns from the per-column listing when the dataset has many columns. The summary block and returned data.table are always complete. ```{r ops-na-threshold} # Only list columns with > 50% missing in the console output ops_na(dt, threshold = 50) # Suppress all per-column lines — summary only ops_na(dt, threshold = 99) ``` ### Programmatic use `ops_na()` returns a `data.table` invisibly, regardless of `threshold`: ```{r ops-na-prog} result <- ops_na(dt, verbose = FALSE) result #> column n_na pct_na #> #> 1: messy_allna 1000 100.0 #> 2: p41280_a4 1000 100.0 #> ... # Identify columns to drop before modelling cols_to_drop <- result[pct_na > 90, column] dt[, (cols_to_drop) := NULL] ``` --- ## `ops_snapshot()` — Pipeline Checkpoints `ops_snapshot()` records a lightweight summary of your dataset at each processing step and stores it in the session cache. Each subsequent call automatically computes deltas (Δ) against the previous snapshot, making it easy to track how rows, columns, and missingness change through the pipeline. ### Recording snapshots ```{r ops-snapshot-record} dt <- ops_toy() ops_snapshot(dt, label = "raw") #> ── snapshot: raw ─────────────────────────────────────────────────────────── #> rows 1,000 #> cols 65 #> NA cols 41 #> size 0.61 MB #> ──────────────────────────────────────────────────────────────────────────── dt <- derive_missing(dt) ops_snapshot(dt, label = "after_derive_missing") #> ── snapshot: after_derive_missing ────────────────────────────────────────── #> rows 1,000 (= 0) #> cols 65 (= 0) #> NA cols 43 (+2) #> size 0.61 MB (= 0) #> ──────────────────────────────────────────────────────────────────────────── dt <- dt[p31 == "Female"] ops_snapshot(dt, label = "female_only") #> ── snapshot: female_only ─────────────────────────────────────────────────── #> rows 570 (-430) #> cols 65 (= 0) #> NA cols 43 (= 0) #> size 0.36 MB (-0.25 MB) #> ──────────────────────────────────────────────────────────────────────────── ``` When `label` is omitted, snapshots are named `snapshot_1`, `snapshot_2`, etc. automatically. Labels should be unique within a session: if the same label is used twice, the history row is appended again but the stored column list is overwritten — which can cause `ops_snapshot_cols()` and `ops_snapshot_diff()` to behave unexpectedly. ### Viewing the full history Call `ops_snapshot()` with no arguments to print and return the complete history data.table: ```{r ops-snapshot-history} ops_snapshot() #> ── ops_snapshot history ──────────────────────────────────────────────────── #> idx label timestamp nrow ncol n_na_cols size_mb #> 1: 1 raw 14:30:01 1000 65 41 0.61 #> 2: 2 after_derive_missing 14:30:05 1000 65 43 0.61 #> 3: 3 female_only 14:30:08 570 65 43 0.36 #> ──────────────────────────────────────────────────────────────────────────── ``` ### Silent recording Set `verbose = FALSE` to record a snapshot without printing anything — useful inside functions or automated scripts: ```{r ops-snapshot-silent} ops_snapshot(dt, label = "pre_assoc", verbose = FALSE) ``` ### Resetting history ```{r ops-snapshot-reset} ops_snapshot(reset = TRUE) #> ✔ Snapshot history cleared. ``` > **Session scope**: the snapshot history lives in ukbflow's session cache and > is cleared when the R session ends or when `ops_snapshot(reset = TRUE)` is > called. It is not written to disk. --- ## Snapshot Helpers ### `ops_snapshot_cols()` — column names at a checkpoint Returns the column names recorded at a given snapshot label, minus protected columns (`eid`, `sex`, `age`, `age_at_recruitment`, and any registered via `ops_set_safe_cols()`). The primary use is building a drop vector after the raw columns are no longer needed. ```{r ops-snapshot-cols} raw_cols <- ops_snapshot_cols("raw") # raw_cols is a character vector of droppable column names ``` Pass `keep` to protect additional columns beyond the defaults: ```{r ops-snapshot-cols-keep} raw_cols <- ops_snapshot_cols("raw", keep = "p53_i0") ``` ### `ops_snapshot_diff()` — compare two checkpoints Returns lists of columns added and removed between two snapshots — useful for auditing what `derive_*` functions produced. ```{r ops-snapshot-diff} result <- ops_snapshot_diff("raw", "after_derive_missing") result$added # columns added in this step result$removed # columns dropped in this step ``` ### `ops_snapshot_remove()` — drop raw columns after deriving Removes the raw columns captured at a snapshot from `data`, keeping any derived columns added since. Built-in safe columns (`eid`, etc.) and columns supplied in `keep` are always retained. ```{r ops-snapshot-remove} # After deriving, drop the original raw columns dt <- ops_snapshot_remove(dt, from = "raw") #> ✔ ops_snapshot_remove: dropped 60 raw columns, 15 remaining. ``` For `data.table` input the operation is by reference (in-place); for `data.frame` input a new `data.table` is returned and the original is not modified. ### `ops_set_safe_cols()` — register study-specific protected columns Adds column names to the session safe list so they are never dropped by `ops_snapshot_cols()` or `ops_snapshot_remove()`. ```{r ops-set-safe-cols} ops_set_safe_cols(c("date_baseline", "age_at_recruitment")) # Clear registered safe cols ops_set_safe_cols(reset = TRUE) ``` --- ## `ops_withdraw()` — Exclude Withdrawn Participants UK Biobank periodically issues withdrawal files listing participants who have revoked consent. `ops_withdraw()` reads the headerless single-column CSV supplied by UKB and removes matching rows from your dataset. Two snapshots (`before_withdraw` / `after_withdraw`) are recorded automatically. ```{r ops-withdraw} dt <- ops_withdraw(dt, file = "withdraw.csv") #> ── snapshot: before_withdraw ─────────────────────────────────────────────── #> rows 502,492 #> ... #> ── snapshot: after_withdraw ──────────────────────────────────────────────── #> rows 502,489 (-3) #> ... #> ℹ Withdrawal file: w854944_20260310.csv (312 IDs) #> ✖ Excluded: 3 participants found in data #> ✔ Remaining: 502,489 participants ``` Run this immediately after loading your extracted dataset, before any `derive_*` steps, so withdrawn participants never enter the analysis. --- ## Typical Workflow The four `ops_*` functions form a natural bookend around the core pipeline: ```{r ops-workflow} library(ukbflow) # 1. Verify environment before starting ops_setup() # 2. Generate test data (or extract real data from RAP) dt <- ops_toy() # 3. Inspect data quality before processing ops_na(dt) # 4. Run pipeline with checkpoints ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "after_derive_missing") dt <- derive_covariate(dt, as_numeric = "p21001_i0", as_factor = c("p31", "p20116_i0") ) ops_snapshot(dt, label = "after_derive_covariate") # 5. Review full pipeline history ops_snapshot() ``` --- ## Getting Help - `?ops_setup`, `?ops_toy`, `?ops_na`, `?ops_snapshot` - `?ops_snapshot_cols`, `?ops_snapshot_diff`, `?ops_snapshot_remove`, `?ops_set_safe_cols` - `?ops_withdraw` - `vignette("get-started")` — end-to-end pipeline overview - `vignette("derive")` — disease phenotype derivation - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)