--- title: "Genetic Risk Score (GRS) Analysis on UKB RAP" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Genetic Risk Score (GRS) Analysis on UKB RAP} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The `grs_*` functions provide an end-to-end pipeline for computing and validating polygenic / genetic risk scores within the UK Biobank Research Analysis Platform (RAP). Because individual-level genotype data cannot be downloaded locally, all computationally intensive steps are executed as cloud jobs via the DNAnexus Swiss Army Knife app. | Function | Where it runs | Purpose | |---|---|---| | `grs_check()` | Local or RAP | Validate and export a SNP weights file | | `grs_bgen2pgen()` | Submits RAP jobs | Convert UKB imputed BGEN files to PGEN | | `grs_score()` | Submits RAP jobs | Score genotypes across chromosomes with plink2 | | `grs_standardize()` | Local or RAP | Z-score standardise GRS columns | | `grs_zscore()` | Local or RAP | Alias for `grs_standardize()` | | `grs_validate()` | Local or RAP | Assess predictive performance (OR/HR, AUC, C-index) | **Typical pipeline:** ``` grs_check() -> grs_bgen2pgen() -> grs_score() -> grs_standardize() -> grs_validate() ``` > **Prerequisite**: you must be authenticated to UKB RAP and have a project > selected. See `vignette("auth")`. --- ## Step 1: Validate the Weights File -- `grs_check()` Before uploading to RAP, validate your SNP weights file. The function reads the file, checks required columns, flags problems, and writes a plink2-compatible space-delimited output. **Required columns:** | Column | Description | |---|---| | `snp` | SNP identifier (expected `rs` + digits format) | | `effect_allele` | Effect allele: one of A / T / C / G | | `beta` | Effect size (log-OR or beta); must be numeric | ```{r grs-check-local} library(ukbflow) # Local: weights file on your machine w <- grs_check("weights.csv", dest = "weights_clean.txt") #> Read 'weights.csv': 312 rows, 5 columns. #> ✔ No NA values. #> ✔ No duplicate SNPs. #> ✔ All SNP IDs match rs[0-9]+ format. #> ✔ All effect alleles are A/T/C/G. #> Beta summary: #> Range : -0.412 to 0.589 #> Mean |beta|: 0.183 #> Positive : 187 (59.9%) #> Negative : 125 (40.1%) #> Zero : 0 #> ✔ Weights file passed checks: 312 SNPs ready for UKB RAP. #> ✔ Saved: 'weights_clean.txt' ``` ```{r grs-check-rap} # On RAP (RStudio) -- use /mnt/project/ paths directly w <- grs_check( file = "/mnt/project/weights/weights.csv", dest = "/mnt/project/weights/weights_clean.txt" ) ``` The validated weights are also returned invisibly for inspection. --- ## Step 2: Convert BGEN to PGEN -- `grs_bgen2pgen()` UKB imputed genotype data is stored in BGEN format on RAP. `grs_bgen2pgen()` submits one Swiss Army Knife job per chromosome to convert BGEN files to the faster PGEN format with a MAF filter applied via plink2. This function is called from your local machine or RAP RStudio -- the heavy lifting runs entirely in the cloud. ```{r grs-bgen2pgen-test} # Test on the smallest chromosome first ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high") #> Uploading 'grs_bgen2pgen_std.R' to RAP ... #> ✔ Uploaded: '/grs_bgen2pgen_std.R' #> Submitting 1 job(s) -- mem2_ssd1_v2_x4 / priority: high #> ✔ chr22 -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx' #> ✔ 1/1 job(s) submitted. Monitor with job_ls(). ``` ```{r grs-bgen2pgen-all} # Full run: small chromosomes on standard, large on upgraded instance ids_small <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/") ids_large <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", instance = "large") # Monitor progress (job_wait() takes one job ID at a time) job_ls() for (id in c(ids_small, ids_large)) job_wait(id) ``` **Instance types:** | Preset | DNAnexus instance | Cores | RAM | SSD | Recommended for | |---|---|---|---|---|---| | `"standard"` | `mem2_ssd1_v2_x4` | 4 | 12 GB | 200 GB | chr 15–22 | | `"large"` | `mem2_ssd2_v2_x8` | 8 | 28 GB | 640 GB | chr 1–16 | **Key arguments:** | Argument | Default | Description | |---|---|---| | `chr` | `1:22` | Chromosomes to process | | `dest` | — | RAP output path for PGEN files (required) | | `maf` | `0.01` | MAF filter passed to plink2 `--maf` | | `instance` | `"standard"` | Instance type preset | | `priority` | `"low"` | Job priority (`"low"` or `"high"`) | > **Storage warning**: chromosomes 1–16 may exceed the 200 GB SSD on > `"standard"` instances. Use `instance = "large"` for these. --- ## Step 3: Calculate GRS Scores -- `grs_score()` Once PGEN files are ready, `grs_score()` uploads your weights file(s) and submits one plink2 scoring job per GRS. Each job scores all 22 chromosomes and saves a single CSV to RAP. ```{r grs-score} ids <- grs_score( file = c( grs_a = "weights/grs_a_weights.txt", grs_b = "weights/grs_b_weights.txt" ), pgen_dir = "/mnt/project/pgen", dest = "/grs/", priority = "high" ) #> ── Uploading 2 weight file(s) to RAP ──────────────────────────────────────── #> Uploading 'grs_a_weights.txt' ... #> ✔ Uploaded: '/grs_a_weights.txt' #> Uploading 'grs_b_weights.txt' ... #> ✔ Uploaded: '/grs_b_weights.txt' #> Submitting 2 job(s) -- mem2_ssd1_v2_x4 / priority: high #> ✔ grs_a -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx' #> ✔ grs_b -> 'job-Gyyyyyyyyyyyyyyyyyyyyyyyy' #> ✔ 2/2 job(s) submitted. Monitor with job_ls(). job_wait(ids) ``` When running from RAP RStudio with weights already at the project root, the upload step is skipped automatically: ```{r grs-score-rap} # On RAP: weights already at /mnt/project/grs_a_weights.txt ids <- grs_score( file = c(grs_a = "/mnt/project/grs_a_weights.txt"), pgen_dir = "/mnt/project/pgen", dest = "/grs/" ) #> ℹ grs_a_weights.txt already at RAP root, skipping upload. ``` > **Important**: the `maf` argument must match the value used in > `grs_bgen2pgen()` so that the correct PGEN files are located. Output per job: `/grs/_scores.csv` with columns `IID` and the GRS score named `GRS_`. --- ## Step 4: Standardise GRS Columns -- `grs_standardize()` After downloading the score CSVs from RAP (via `fetch_file()`) and merging them into your analysis dataset, Z-score standardise each GRS column so that effect estimates are interpretable as per-SD associations. ```{r grs-standardize} # Auto-detect all columns containing "grs" (case-insensitive) cohort <- grs_standardize(cohort) #> Auto-detected 2 GRS column(s): 'GRS_a', 'GRS_b' #> ✔ GRS_a -> GRS_a_z [mean=0.0000, sd=1.0000] #> ✔ GRS_b -> GRS_b_z [mean=0.0000, sd=1.0000] ``` ```{r grs-standardize-explicit} # Or specify columns explicitly cohort <- grs_standardize(cohort, grs_cols = c("GRS_a", "GRS_b")) ``` `grs_zscore()` is an alias and produces an identical result. The original columns are preserved; `_z` columns are inserted immediately after their source column. --- ## Step 5: Validate Predictive Performance -- `grs_validate()` `grs_validate()` runs four complementary validation analyses for each GRS: | Analysis | What it measures | |---|---| | **Per SD** | OR (logistic) or HR (Cox) per 1-SD increase in GRS | | **High vs Low** | OR / HR comparing top 20% vs bottom 20% | | **Trend test** | P-trend across Q1–Q4 quartiles | | **Discrimination** | AUC (logistic) or C-index (Cox) with 95% CI | ### Logistic (cross-sectional) ```{r grs-validate-logistic} res <- grs_validate( data = cohort, grs_cols = c("GRS_a_z", "GRS_b_z"), outcome_col = "outcome" ) #> ── Creating GRS groups ─────────────────────────────────────────────────────── #> ── Effect per SD (OR) ─────────────────────────────────────────────────────── #> ── High vs Low ────────────────────────────────────────────────────────────── #> ── Trend test ─────────────────────────────────────────────────────────────── #> ── AUC ────────────────────────────────────────────────────────────────────── #> ✔ Validation complete. res$per_sd res$discrimination ``` ### Cox (survival) ```{r grs-validate-cox} res <- grs_validate( data = cohort, grs_cols = c("GRS_a_z", "GRS_b_z"), outcome_col = "outcome", time_col = "followup_years", covariates = c("age", "sex", paste0("pc", 1:10)) ) res$per_sd # HR per SD x model res$high_vs_low # HR: top 20% vs bottom 20% res$trend # p-trend across Q1–Q4 res$discrimination # C-index with 95% CI ``` **Return value:** a named list with four `data.table` elements. | Element | Columns (logistic) | Columns (Cox) | |---|---|---| | `per_sd` | `exposure`, `term`, `model`, `OR`, `CI_lower`, `CI_upper`, `p_value` | same with `HR` | | `high_vs_low` | same as `per_sd` | same with `HR` | | `trend` | `exposure`, `model`, `p_trend` | same | | `discrimination` | `GRS`, `AUC`, `CI_lower`, `CI_upper` | `GRS`, `C_index`, `CI_lower`, `CI_upper` | > AUC calculation requires the `pROC` package. > Install with `install.packages("pROC")` if needed. --- ## Complete Pipeline Example ```{r grs-full-pipeline} library(ukbflow) # 1. Validate weights (local) grs_check("weights.csv", dest = "weights_clean.txt") # 2. Convert BGEN -> PGEN on RAP (submit jobs) ids_std <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/", maf = 0.01) ids_lrg <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", maf = 0.01, instance = "large") for (id in c(ids_std, ids_lrg)) job_wait(id) # 3. Score GRS on RAP (submit jobs) score_ids <- grs_score( file = c(grs_a = "weights_clean.txt"), pgen_dir = "/mnt/project/pgen", maf = 0.01, # must match grs_bgen2pgen() dest = "/grs/" ) job_wait(score_ids) # 4. Download score CSV from RAP fetch_file("/grs/GRS_a_scores.csv", dest = "GRS_a_scores.csv") # 5. Merge into analysis dataset and standardise # cohort: your analysis data.table with IID and phenotype columns scores <- data.table::fread("GRS_a_scores.csv") # columns: IID, GRS_a cohort <- scores[cohort, on = "IID"] # right-join: keep all cohort rows cohort <- grs_standardize(cohort, grs_cols = "GRS_a") # 6. Validate res <- grs_validate( data = cohort, grs_cols = "GRS_a_z", outcome_col = "outcome", time_col = "followup_years", covariates = c("age", "sex", paste0("pc", 1:10)) ) res$per_sd res$discrimination ``` --- ## Getting Help - `?grs_check`, `?grs_bgen2pgen`, `?grs_score` - `?grs_standardize`, `?grs_validate` - `vignette("auth")` -- RAP authentication and project selection - `vignette("job")` -- monitoring and managing RAP jobs - `vignette("assoc")` -- association analysis functions used inside `grs_validate()` - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)