--- title: "diagFDR: generic PSM diagnostics from mzIdentML (.mzid)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{diagFDR: generic PSM diagnostics from mzIdentML (.mzid)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) ``` This vignette demonstrates how to compute **diagFDR** diagnostics from a generic **mzIdentML** file (`.mzid`). Many search engines can export identifications in mzIdentML format. When explicit q-values and/or PEPs are not present in the mzIdentML, `diagFDR` can: 1. extract a **competed PSM universe** (rank-1 by default), 2. infer target/decoy labels, 3. select a primary numeric score CV term (configurable), 4. reconstruct **TDC q-values** from scores via target--decoy counting (TDC), 5. compute scope/calibration/stability diagnostics. > Note: mzIdentML is flexible and tool-dependent. If the score CV term or decoy labeling > is not encoded consistently, you may need to adjust `score_accession_preference`, > `score_direction`, and/or `decoy_regex`. ## Runnable toy example (no external files required) This section provides a small *toy* dataset that mimics the **PSM-level competed universe** returned by `read_dfdr_mzid()` after choosing a score and reconstructing TDC q-values. ```{r toy} library(diagFDR) set.seed(3) n <- 7000 pi_decoy <- 0.03 # Decoy indicator is_decoy <- sample(c(FALSE, TRUE), n, replace = TRUE, prob = c(1 - pi_decoy, pi_decoy)) # Targets are a mixture: some null-like (close to decoys), some true (higher score) # This yields realistic separation and non-trivial discoveries at 1% FDR. is_true_target <- (!is_decoy) & (runif(n) < 0.30) # 30% of targets are "true" is_null_target <- (!is_decoy) & (!is_true_target) score <- numeric(n) score[is_decoy] <- rnorm(sum(is_decoy), mean = 0.0, sd = 1.0) score[is_null_target] <- rnorm(sum(is_null_target), mean = 0.2, sd = 1.0) score[is_true_target] <- rnorm(sum(is_true_target), mean = 3.0, sd = 1.0) toy <- data.frame( id = paste0("psm", seq_len(n)), is_decoy = is_decoy, run = sample(paste0("run", 1:4), n, replace = TRUE), score = score, pep = NA_real_ ) # Score-based TDC q-values (higher score is better) toy <- toy[order(toy$score, decreasing = TRUE), ] toy$D_cum <- cumsum(toy$is_decoy) toy$T_cum <- cumsum(!toy$is_decoy) toy$FDR_hat <- (toy$D_cum + 1) / pmax(toy$T_cum, 1) toy$q <- rev(cummin(rev(toy$FDR_hat))) toy <- toy[, c("id","is_decoy","q","pep","run","score")] x_toy <- as_dfdr_tbl( toy, unit = "psm", scope = "global", q_source = "toy TDC from score", q_max_export = 1, provenance = list(tool = "toy") ) diag <- dfdr_run_all( xs = list(mzid_PSM = x_toy), alpha_main = 0.01, alphas = c(1e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1), eps = 0.2, win_rel = 0.2, truncation = "warn_drop", low_conf = c(0.2, 0.5) ) ``` ### Headline stability at 1% ```{r headline} diag$tables$headline if (nrow(diag$tables$headline) > 0 && diag$tables$headline$T_alpha[1] == 0) { cat("\nNote: No discoveries at alpha_main for this toy run. ", "For demonstration, consider using alpha_main = 0.02.\n", sep = "") } ``` ### Tail support and stability versus threshold ```{r stability-plots} diag$plots$dalpha diag$plots$cv ``` ### Local boundary support and elasticity ```{r boundary-stability} diag$plots$dwin diag$plots$elasticity ``` ### Equal-chance plausibility by q-band ```{r equal-chance} diag$tables$equal_chance_pooled diag$plots$equal_chance__mzid_PSM ``` ## Real mzIdentML workflow (.mzid) The code below shows how to run **diagFDR** directly from a `.mzid` file. ```{r real-mzid, eval=FALSE} library(diagFDR) mzid_path <- "path/to/search_results.mzid" x_mzid <- read_dfdr_mzid( mzid_path = mzid_path, rank = 1L, # competed universe: take rank-1 SpectrumIdentificationItem # Choose a score CV term (priority list) and interpret its direction score_accession_preference = c( "MS:1002257", # MS-GF:RawScore (higher is better) "MS:1001330", # Mascot:score (higher is better) "MS:1001328", # SEQUEST:xcorr (higher is better) "MS:1002052", "MS:1002049", "MS:1001331", "MS:1001171", "MS:1001950", "MS:1002466" ), score_direction = "auto", # or "higher_better"/"lower_better" if auto fails # TDC correction: FDR_hat = (D + add_decoy)/T add_decoy = 1L, # Strict by default: require score for all PSMs (set <1 to allow missing) min_score_coverage = 1.0, # Fallback decoy inference if PeptideEvidence@isDecoy is not informative decoy_regex = "(^##|_REVERSED$|^REV_|^DECOY_)", unit = "psm", scope = "global", provenance = list(file = basename(mzid_path)) ) diag <- dfdr_run_all( xs = list(mzid_PSM = x_mzid), alpha_main = 0.01, alphas = c(1e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1), eps = 0.2, win_rel = 0.2, truncation = "warn_drop", low_conf = c(0.2, 0.5) ) # Export outputs dfdr_write_report( diag, out_dir = "diagFDR_mzid_out", formats = c("csv", "png", "manifest", "readme", "summary") ) # Render a single HTML report (requires rmarkdown in Suggests) dfdr_render_report(diag, out_dir = "diagFDR_mzid_out") ``` ## Optional: pseudo-p-values from the decoy score tail If you want to run the p-value diagnostics (calibration plots, Storey pi0, BH stability), you can map scores to **pseudo-p-values** using the empirical decoy null tail. ```{r pseudo-p, eval=FALSE} x_mzid$p <- score_to_pvalue( score = x_mzid$score, method = "decoy_ecdf", is_decoy = x_mzid$is_decoy ) attr(x_mzid, "meta")$p_source <- "score_to_pvalue(method='decoy_ecdf' on mzid score)" diag_p <- dfdr_run_all( xs = list(mzid_PSM = x_mzid), alpha_main = 0.01 ) dfdr_write_report( diag_p, out_dir = "diagFDR_mzid_out_with_p", formats = c("csv","png","manifest","readme","summary") ) dfdr_render_report(diag_p, out_dir = "diagFDR_mzid_out_with_p") ``` ## Interpretation notes / common pitfalls - **Score selection matters:** mzIdentML files can contain multiple score CV terms. If `read_dfdr_mzid()` picks an unexpected score, adjust `score_accession_preference` and/or set `score_direction` explicitly. - **Decoy labeling matters:** ideally, mzIdentML includes `PeptideEvidence@isDecoy`. If not, `read_dfdr_mzid()` falls back to `decoy_regex` applied to protein accessions. If your pipeline uses a different decoy naming convention, update `decoy_regex`. - **Missing scores:** if some PSMs are missing the chosen score CV term, you can relax `min_score_coverage` (e.g., 0.95) but then interpret results carefully, as the hypothesis universe changes. - **Scope:** this vignette constructs a PSM-level universe. Peptide- or protein-level claims require additional aggregation/inference.