--- title: "Introduction to easyLSEA" author: "easyLSEA authors" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to easyLSEA} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview **easyLSEA** provides a complete pipeline for Lipid Set Enrichment Analysis (LSEA) in R. Starting from a table of differential lipid abundances, it annotates lipids into biological groups, tests whether those groups are systematically shifted between conditions, and produces publication-ready bubble plots, distribution plots, and fatty acid chain visualizations. The package runs two complementary enrichment engines: - **Kolmogorov–Smirnov (KS)** — tests whether the entire logFC distribution of a lipid set differs from the background. Sensitive to distributed moderate effects across many set members. - **fgsea** — tests whether set members cluster at the extreme ends of the ranked list. Sensitive to a small number of strongly regulated lipids. Running both engines and comparing their results (convergence analysis) gives a more complete picture of lipid remodeling than either method alone. --- ## Installation Install the stable version from CRAN: ```{r install-cran} install.packages("easyLSEA") ``` Or the development version from GitHub: ```{r install-github} # install.packages("remotes") remotes::install_github("DavidGO464/easyLSEA") ``` To enable the fgsea engine, install the optional Bioconductor dependency: ```{r install-fgsea} BiocManager::install("fgsea") ``` --- ## Input data `easyLSEA` expects a `data.frame` with at least: | Column | Description | |--------|-------------| | `LipidName` | Lipid identifier in standard shorthand notation (e.g. `PC 36:4`, `TG 54:3`) | | `logFC` | log2 fold-change (case vs reference) | | `P.Value` | Raw (unadjusted) p-value — used for the fgsea pi-value rank metric | Additional columns are used when present: | Column | Used for | |--------|----------| | `adj.P.Val` | Counting significantly changed lipids | | `Confidence_rank` | Annotation confidence filtering | | `Shorthand` | Alternative lipid name fallback | Column names are configurable via `lipid_col`, `fc_col`, and `pval_col` arguments — only the defaults are shown above. --- ## Quick start The entire pipeline runs in a single call to `easyLSEA()`: ```{r quickstart} library(easyLSEA) result <- easyLSEA( data = my_lipid_data, lipid_col = "LipidName", fc_col = "logFC", pval_col = "P.Value", case_lbl = "NASH", ref_lbl = "Control", engine = "both", # run KS and fgsea min_rank = "E" # include all confidence ranks except P and NA (default) ) ``` This returns an `easyLSEA_result` object with five slots described in the next section. --- ## Understanding the output ### `result$meta` A named list with run metadata: date, comparison labels, engine used, number of lipids, and the original function call. ```{r meta} result$meta$date result$meta$case_lbl result$meta$n_lipids ``` ### `result$lsea` Contains the enrichment statistics. The key sub-elements are: ```{r lsea} # KS results — one row per lipid set per grouping level head(result$lsea$ks) # fgsea results head(result$lsea$fgsea) # Combined table with Convergence column head(result$lsea$combined) ``` **Key columns in the KS results:** | Column | Description | |--------|-------------| | `Group` | Lipid set name (e.g. `PC`, `Glycerolipids`) | | `Level` | Grouping level (`LipidClass`, `LipidCategory_LMAPS`, `LipidCategory_functional`) | | `N_group` | Number of lipids in the set | | `DirectionalScore` | Standardized mean difference (Cohen's *d* analog). Positive = up in case. | | `KS_pval` | Two-sided KS p-value | | `FDR_LSEA` | BH-adjusted FDR | | `DS_perm_pval` | Permutation p-value for the DirectionalScore | | `ContributingLipids_KS` | Lipids on the enriched side of the CDF divergence point | **Key columns in the fgsea results:** | Column | Description | |--------|-------------| | `NES` | Normalized Enrichment Score. Positive = enriched toward top of ranked list (up in case). | | `FDR_fgsea` | BH-adjusted FDR from fgsea | | `N_leading` | Leading edge size | | `LeadingEdge` | Lipids in the leading edge | | `rank_metric` | Rank metric used (`pi_value`, `logFC`, or `t_stat`) | **The `Convergence` column** in the combined table classifies each set as: | Value | Meaning | |-------|---------| | `KS+fgsea [strongest]` | Significant by both engines — highest confidence | | `KS only [distributed effect]` | Moderate shift across many lipids | | `fgsea only [extreme-driven]` | A few strongly regulated lipids drive the signal | | `Neither` | Not significant by either engine | ### `result$chains` Fatty acid chain analysis results, available when `run_chains = TRUE`: ```{r chains} # Long format — one row per acyl chain per lipid head(result$chains$parsed) # Parsing status — one row per lipid head(result$chains$summary) # Wide format — one row per lipid with sn positions and totals head(result$chains$wide) ``` The `wide` table is the most convenient for reporting. Each row is one lipid, with columns `sn1`, `sn2`, `sn3`, `sn4` containing the individual acyl chain positions (e.g. `"18:1"`), and `total_carbons` / `total_unsat` with the summed totals. The `chain_type` column clarifies how to interpret the sn columns: | `chain_type` | `sn1` | `sn2` | `sn3` | `sn4` | |---|---|---|---|---| | `sn2` (PC, PE, PS...) | sn-1 chain | sn-2 chain | NA | NA | | `nacyl` (Cer, SM...) | sphingoid base | N-acyl chain | NA | NA | | `long_format` (TG) | chain 1 | chain 2 | chain 3 | NA | | `long_format` (CL) | chain 1 | chain 2 | chain 3 | chain 4 | | `single` (CAR, LPC) | the chain | NA | NA | NA | ### `result$plots` A named list of `ggplot2` objects, available when `plots = TRUE`: ```{r plots-list} # List all available plots names(result$plots$lsea) names(result$plots$chains) ``` Plot naming convention for LSEA bubble plots: | Name pattern | Description | |-------------|-------------| | `bubble_ks_01_Class` | KS bubble plot — lipid class level | | `bubble_ks_sig_01_Class` | KS bubble plot — significant sets only | | `bubble_fgsea_01_Class` | fgsea bubble plot — lipid class level | | `bubble_fgsea_sig_01_Class` | fgsea bubble plot — significant sets only | | `dist_01_Class` | Distribution (boxplot) — lipid class level | Levels: `01_Class` (lipid class), `02_LMAPS` (LIPID MAPS category), `03_Functional` (functional category). ### `result$input` The annotated input data and the grouping columns used: ```{r input} # Annotated data with LipidClass, LipidCategory_LMAPS, etc. head(result$input$data) # Grouping columns tested result$input$group_cols ``` --- ## Viewing plots Individual plots can be displayed directly: ```{r view-plots} # KS bubble plot — all lipid classes result$plots$lsea$bubble_ks_01_Class # fgsea bubble plot — significant sets only result$plots$lsea$bubble_fgsea_sig_01_Class # Distribution plot — lipid class level result$plots$lsea$dist_01_Class ``` To customize bubble labels: ```{r bubble-label} # Regenerate plots showing only FDR and n plots <- plot_lsea( result$lsea, case_lbl = "NASH", ref_lbl = "Control", bubble_label = c("FDR", "n") ) ``` --- ## Exporting results `export_lsea()` saves all results to a timestamped folder. Supply the output directory explicitly via `dir` (here a temporary directory; for a real analysis use a folder of your choice): ```{r export} export_lsea( result, dir = tempdir(), format = c("csv", "excel", "pdf") ) ``` This creates: ``` easyLSEA_NASH_vs_Control_2024-01-15_1430/ tables/ lsea_results_ks.csv lsea_results_fgsea.csv lsea_combined.csv chain_results.csv plots/ lsea/ 01_Class/ bubble_ks_01_Class.pdf bubble_ks_sig_01_Class.pdf bubble_fgsea_01_Class.pdf bubble_fgsea_sig_01_Class.pdf dist_01_Class.pdf 02_LMAPS/ ... 03_Functional/ ... chains/ tile/ ... trend/ ... results.xlsx ``` --- ## Advanced usage ### Running engines separately ```{r advanced-separate} # Step 1: annotate annotated <- annotate_lipids(my_lipid_data, lipid_col = "LipidName") # Step 2: run enrichment lsea_res <- run_lsea( data = annotated, fc_col = "logFC", engine = "both", case_lbl = "NASH", ref_lbl = "Control" ) # Step 3: generate plots manually plots <- plot_lsea( lsea_res, case_lbl = "NASH", ref_lbl = "Control", fdr_thresh = 0.05, bubble_label = c("FDR", "DS", "NES", "n") ) # Step 4: distribution plot for a specific level p_dist <- plot_distribution( data = annotated, lsea_result = lsea_res, group_col = "LipidClass", case_lbl = "NASH", ref_lbl = "Control" ) ``` ### Changing the fgsea rank metric ```{r fgsea-rank} # Default: pi-value = sign(logFC) * -log10(P.Value) # Combines effect size and statistical evidence # Alternative: logFC only result_fc <- easyLSEA( data = my_lipid_data, engine = "fgsea", fgsea_rank = "logFC" ) # Alternative: LIMMA t-statistic (requires a 't' column) result_t <- easyLSEA( data = my_lipid_data, engine = "fgsea", fgsea_rank = "t_stat" ) ``` ### Adjusting significance thresholds ```{r thresholds} result <- easyLSEA( data = my_lipid_data, min_n = 5L, # require at least 5 lipids per set n_perm = 5000L, # more permutations for DS_perm_pval fgsea_nperm = 20000L # more permutations for fgsea ) ``` ### Filtering by annotation confidence rank When lipid annotations include a confidence rank (A > B > C > D > E > P), `min_rank` controls which lipids enter the chain analysis: ```{r min-rank} # Default: include all except P and NA result_all <- easyLSEA(data = my_lipid_data, min_rank = "E") # Strict: include only high-confidence annotations (A and B) result_strict <- easyLSEA(data = my_lipid_data, min_rank = "B") # Or apply directly to parse_lipid_chains chains_strict <- parse_lipid_chains(annotated, min_rank = "B") table(chains_strict$summary$status) ``` Lipids excluded by `min_rank` appear in `result$chains$summary` with status `excluded_rank_below_X` (where X is the chosen threshold), making it easy to audit which lipids were filtered. --- ## Interpreting results ### KS vs fgsea — which to trust? Both engines test the same null hypothesis (no systematic enrichment) but are sensitive to different signal patterns: - **KS** is powerful when many lipids in a class shift moderately in the same direction. A significant KS result with `DirectionalScore ≈ 0` suggests a variance or shape difference rather than a net directional shift — interpret with caution. - **fgsea** is powerful when a small number of lipids drive an extreme signal. A significant fgsea result without KS support suggests the enrichment is carried by a few lipids rather than the class as a whole. **Convergent results** (significant by both) provide the strongest evidence for coordinated lipid class remodeling. ### The DirectionalScore The `DirectionalScore` is a standardized mean difference (Cohen's *d* analog): ``` DS = (mean_logFC_set - mean_logFC_background) / SD_pooled ``` It quantifies the **direction and magnitude** of the shift, independent of the KS p-value. A set can have a significant KS p-value (distributional difference) with a near-zero DirectionalScore (no net direction) — this pattern suggests heterogeneous within-set regulation. ### The pi-value rank metric By default, fgsea ranks lipids by: ``` pi-value = sign(logFC) × −log10(P.Value) ``` This combines the direction of change with statistical confidence, giving more weight to lipids that are both strongly and significantly regulated. It is preferred over logFC alone when p-values are available. --- ## Session info ```{r session, eval = TRUE} sessionInfo() ```