The sffdr package implements the surrogate functional false discovery
rate (sfFDR) procedure. This methodology integrates GWAS summary
statistics from related traits (i.e., pleiotropy) to increase
statistical power within the functional FDR framework. The inputs into
sffdr are a set of p-values from a GWAS of interest and a
set of p-values from one or many informative GWAS.
The significance quantities estimated by sffdr can be
used for a variety of analyses:
The methods implemented in this package are described in:
Bass AJ, Wallace C. Exploiting pleiotropy to enhance variant discovery with functional false discovery rates. Nature Computational Science; 2025.
Note that this work is an extension of the functional FDR methodology
and the software builds on some of the functions in the
fFDR package found at https://github.com/StoreyLab/fFDR.
To report any bugs or issues related to usage please report it on GitHub at https://github.com/ajbass/sffdr.
You can install the development version of sffdr from
GitHub:
To demonstrate the package, we will use a sample dataset containing 10,000 SNPs for body mass index (BMI) as the primary trait of interest, with body fat percentage (BFP), cholesterol, and triglycerides as informative traits.
library(sffdr)
data(bmi)
# Define primary p-values and informative p-values
p <- sumstats$bmi
z <- as.matrix(sumstats[,-1])
head(sumstats)
#> bmi bfp cholesterol triglycerides
#> 1 0.4959680 0.313475 0.0394766 0.148379
#> 2 0.6937320 0.565400 0.8986590 0.298349
#> 3 0.0843257 0.447650 0.7301960 0.715846
#> 4 0.2623540 0.911920 0.1753410 0.775170
#> 5 0.2150660 0.668330 0.0526082 0.603160
#> 6 0.9501450 0.607349 0.2759850 0.867493The first step is to model the relationship between the functional
proportion of null tests and the informative traits. We use
pi0_model to create a model and fpi0est to
estimate \(\pi_{0}(z)\).
# Create model
mpi0 <- pi0_model(z)
#> ==================================================
#> Building Pi0 Model (Rank: Signal -> 0.0)
#> Variables : 3
#> Min Discoveries : 150 (Raw SNPs)
#> --------------------------------------------------
#> [Auto-Scale]: N = 10,000. Bandwidth set to 200 independent SNPs per knot.
#> [bfp]: Found 220 discoveries (FDR < 0.25). Calculating knots...
#> [bfp]: Final Model -> Spline (2 knots: 0.022, 0.1)
#> [cholesterol]: Fallback -> 1 Knot at 0.1000
#> [triglycerides]: Found 161 discoveries (FDR < 0.25). Calculating knots...
#> [triglycerides]: Final Model -> Spline (2 knots: 0.0161, 0.1)
#> --------------------------------------------------
#> ~splines::ns(bfp, knots = c(0.022, 0.1), Boundary.knots = c(0, 1)) + splines::ns(cholesterol, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(triglycerides, knots = c(0.0161, 0.1), Boundary.knots = c(0, 1))
#> ==================================================
# Estimation of functional pi0
fpi0 <- fpi0est(p, mpi0)
#> ==================================================
#> Estimating Functional Pi0
#> ==================================================
#> Fitting models across 19 lambda values...
#> | | | 0% | |== | 5% | |===== | 11% | |======= | 16% | |========== | 21% | |============ | 26% | |=============== | 32% | |================= | 37% | |==================== | 42% | |====================== | 47% | |========================= | 53% | |=========================== | 58% | |============================== | 63% | |================================ | 68% | |=================================== | 74% | |===================================== | 79% | |======================================== | 84% | |========================================== | 89% | |============================================= | 95% | |===============================================| 100%
#> Selecting optimal lambda via MISE...
#> Selected lambda: 0.25
#> Generating predictions...
#> Done.
#> ==================================================
#> Note on Knots: Ideally, knots should cover the
distribution of non-null p-values in your informative studies. Our
algorithm uses an FDR threshold to choose the location and number of
knots. Depending on the study, you may want to adjust these values (see
?pi0_model). Note that this toy example contains only
10,000 p-values, so the knot locations are not optimal since the
function was designed for GWAS-scale data.
With the functional \(\pi_{0}\) estimated, we can now calculate the FDR quantities and functional p-values.
# Apply sfFDR
sffdr_out <- sffdr(p, fpi0 = fpi0$fpi0)
#> ==================================================
#> Running sfFDR Analysis (v1.1.2)
#> ==================================================
#> Transforming surrogate variable...
#> Fitting kernel density...
#> High-Res Tcub (nn=0.05, h=0.10772)... ✓
#> Enforcing conditional monotonicity using the 'min' method...
#> Dividing data into 10 conditional surrogate bins...
#> Computing functional local FDRs...
#> Computing functional p-values and q-values...
#> Done.
#> ==================================================
# Extract results
results <- data.frame(
p_value = p,
fp_value = sffdr_out$fpvalues,
fq_value = sffdr_out$fqvalues,
flfdr = sffdr_out$flfdr
)
# Plot significance results (focusing on high-significance SNPs)
plot(sffdr_out, rng = c(0, 1e-6))
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.In standard GWAS, SNPs are often correlated due to LD. To account for
this, you should provide a vector indicating which SNPs are
LD-independent (e.g., via pruning). sffdr uses these
independent SNPs to fit the model, then computes the significance
quantities for the entire dataset.
# logical vector: TRUE if SNP is LD-independent
# (In this example, all SNPs are already independent)
indep_snps <- rep(TRUE, length(p))
# Fit model using LD-independent subset
mpi0 <- pi0_model(z = z, indep_snps = indep_snps)
#> ==================================================
#> Building Pi0 Model (Rank: Signal -> 0.0)
#> Variables : 3
#> Min Discoveries : 150 (Effective Independent)
#> --------------------------------------------------
#> [Auto-Scale]: N = 10,000. Bandwidth set to 200 independent SNPs per knot.
#> [bfp]: Found 220 independent discoveries (FDR < 0.25). Calculating knots...
#> [bfp]: Final Model -> Spline (2 knots: 0.022, 0.1)
#> [cholesterol]: Fallback -> 1 Knot at 0.1000
#> [triglycerides]: Found 161 independent discoveries (FDR < 0.25). Calculating knots...
#> [triglycerides]: Final Model -> Spline (2 knots: 0.0161, 0.1)
#> --------------------------------------------------
#> ~splines::ns(bfp, knots = c(0.022, 0.1), Boundary.knots = c(0, 1)) + splines::ns(cholesterol, knots = c(0.1), Boundary.knots = c(0, 1)) + splines::ns(triglycerides, knots = c(0.0161, 0.1), Boundary.knots = c(0, 1))
#> ==================================================
fpi0 <- fpi0est(p, mpi0, indep_snps = indep_snps)
#> ==================================================
#> Estimating Functional Pi0
#> ==================================================
#> Fitting models across 19 lambda values...
#> | | | 0% | |== | 5% | |===== | 11% | |======= | 16% | |========== | 21% | |============ | 26% | |=============== | 32% | |================= | 37% | |==================== | 42% | |====================== | 47% | |========================= | 53% | |=========================== | 58% | |============================== | 63% | |================================ | 68% | |=================================== | 74% | |===================================== | 79% | |======================================== | 84% | |========================================== | 89% | |============================================= | 95% | |===============================================| 100%
#> Selecting optimal lambda via MISE...
#> Selected lambda: 0.25
#> Generating predictions...
#> Done.
#> ==================================================
#>
# Estimate quantities for all SNPs
sffdr_out <- sffdr(p, fpi0 = fpi0$fpi0)
#> ==================================================
#> Running sfFDR Analysis (v1.1.2)
#> ==================================================
#> Transforming surrogate variable...
#> Fitting kernel density...
#> High-Res Tcub (nn=0.05, h=0.10772)... ✓
#> Enforcing conditional monotonicity using the 'min' method...
#> Dividing data into 10 conditional surrogate bins...
#> Computing functional local FDRs...
#> Computing functional p-values and q-values...
#> Done.
#> ==================================================Note on choosing independent SNPs: There are two
strategies: (1) prune using plink (e.g.,
--indep-pairwise 200kb 1 0.3) and use the pruned set of
SNPs as the independent set, or (2) use informed sampling using the
informative traits (e.g., clumping or selecting a representative SNP in
an LD block with the smallest informative p-value). The latter is likely
better when the primary study has low power.
The kernel density estimator is adaptive. However, sometimes artifacts—such as LD—can induce spurious “bumps” in the null region which can impact density estimation.
You can enforce a monotone constraint within surrogate bins (targeting ~1,000 independent SNPs per bin) to stabilize the joint density:
While this provides robustness in the null region, users should note it may result in a loss of power in specific cases. We are considering making this the default behavior in future releases, but for now it is optional.
If LD scores (\(l_{j}\)) are available (e.g., from LDSC regression), they can be utilized as inverse-weights (\(w_j=1/l_{j}\)). This allows the estimators to account for LD directly while allowing the full dataset to be used. As such, pre-filtering for independent SNPs (LD-pruning) is unnecessary.
Below we show how to incorporate weights into sfFDR:
# Create an artificial LD block
LD_block <- 500:1000
p[LD_block] <- runif(length(LD_block), min = 0.78, max = 0.82)
# Define weights (inverse LD)
w <- rep(1.0, length(p))
w[LD_block] <- 0.05
# Apply sfFDR with weights
mpi0 <- pi0_model(z, weights = w)
fpi0 <- fpi0est(p, mpi0, weights = w)
sffdr_out <- sffdr(p, fpi0 = fpi0$fpi0, weights = w)Note on incorporating LD scores: As LD scores are available for common reference panels (e.g., 1000 Genomes), they can be easily downloaded to use in sfFDR. If you have LD scores calculated from your data then that’s even better. However, it is problematic to use LD scores when the reference panel is not matched to your study population. In that case, the pruning (or informative sampling) strategy discussed above is the best option.
The selection of knots in the B-spline basis is an important tuning parameter in sffdr. If your surrogate GWAS has extremely high power (very small p-values), we recommend placing more knots at the lower end of the p-value distribution to capture the local density of signals.
# Create model (can include other variables (e.g., MAF) or specify more complicated models)
fmod <- "~ns(bfp, knots = c(0.01, 0.025, 0.05, 0.1))"
fpi0_mod <- fpi0est(p = p,
z = mpi0$zt,
pi0_model = fmod)Note on incorporating additional variables: Other informative covariates can be included in the model (e.g., MAF, LD score, etc.). While the LD score annotations can be used (e.g., the baselineLD model), we have found that using the expected \(\chi^{2}\) given the annotations performs better in practice and simplifies the model fitting.