This vignette demonstrates how to perform multi-trait colocalization
analysis using summary statistics data, specifically focusing on the
Sumstat_5traits
dataset included in the package.
Sumstat_5traits
DatasetThe Sumstat_5traits
dataset contains 5 simulated summary
statistics, where it is directly derived from the
Ind_5traits
dataset using marginal association. The dataset
is specifically designed to evaluate and demonstrate the capabilities of
ColocBoost in multi-trait colocalization analysis with summary
association data.
sumstat
: A list of data.frames of summary statistics
for different traits.true_effect_variants
: True effect variants indices for
each trait.LD
could be calculated from the
X
data in the Ind_5traits
dataset, but it is
not included in the Sumstat_5traits
dataset.The dataset features two causal variants with indices 194 and 589.
This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.
# Loading the Dataset
data("Sumstat_5traits")
names(Sumstat_5traits)
#> [1] "sumstat" "true_effect_variants"
Sumstat_5traits$true_effect_variants
#> $Outcome_1
#> [1] 194
#>
#> $Outcome_2
#> [1] 194 589
#>
#> $Outcome_3
#> [1] 194 589
#>
#> $Outcome_4
#> [1] 194
#>
#> $Outcome_5
#> [1] 589
Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in colocboost paper repo.
sumstat
must include the following columns:
z
or (beta
, sebeta
): either
z-score or (effect size and standard error)n
: sample size for the summary statistics.
Highly recommended: Providing the sample size, or even
a rough estimate of n
, is highly recommended. Without
n
, the implicit assumption is n
is large (Inf)
and the effect sizes are small (close to zero).variant
: required if sumstat
for different
outcomes do not have the same number of variables (multiple
sumstat
and multiple LD
).When studying multiple traits with their own trait-specific LD matrices, you could provide a list of LD matrices matched with a list of summary statistics.
sumstat
and
LD
are organized as lists, matched by trait index,
(sumstat[1], LD[1])
contains information for trait
1,(sumstat[2], LD[2])
contains information for trait
2,# Duplicate LD with matched summary statistics
LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD )
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
When the LD matrix includes a superset of variants across different summary statistics, with Input Format:
sumstat
is a list of data.frames for all traitsLD
is a matrix of linkage disequilibrium (LD)
information for all variants across all traits.# Create sumstat with different number of variants - remove 100 variants in each sumstat
LD_superset <- LD
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE])
# Run colocboost
res <- colocboost(sumstat = sumstat, LD = LD_superset)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 41 iterations!
#> Gradient boosting for outcome 5 converged after 60 iterations!
#> Gradient boosting for outcome 1 converged after 62 iterations!
#> Gradient boosting for outcome 3 converged after 93 iterations!
#> Gradient boosting for outcome 2 converged after 95 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
When studying multiple traits with arbitrary LD matrices for different summary statistics, we also provide the interface for arbitrary LD matrices with multiple sumstat. This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of summary statistics, LD comes from different populations.
sumstat = list(sumstat1, sumstat2, sumstat3, sumstat4, sumstat5)
is a list of data.frames for all traits.LD = list(LD1, LD2)
is a list of LD matrices.dict_sumstatLD
is a dictionary matrix that index of
sumstat to index of LD.# Create a simple dictionary for demonstration purposes
LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix.
dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2))
# Display the dictionary
dict_sumstatLD
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 1
#> [3,] 3 2
#> [4,] 4 2
#> [5,] 5 2
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix.
# Loading the Dataset
data(Ind_5traits)
X <- Ind_5traits$X
Y <- Ind_5traits$Y
# Coverting to HyPrColoc compatible format
effect_est <- effect_se <- effect_n <- c()
for (i in 1:length(X)){
x <- X[[i]]
y <- Y[[i]]
effect_n[i] <- length(y)
output <- susieR::univariate_regression(X = x, y = y)
effect_est <- cbind(effect_est, output$beta)
effect_se <- cbind(effect_se, output$sebeta)
}
colnames(effect_est) <- colnames(effect_se) <- c("Y1", "Y2", "Y3", "Y4", "Y5")
rownames(effect_est) <- rownames(effect_se) <- colnames(X[[1]])
# Run colocboost
LD <- get_cormat(Ind_5traits$X[[1]])
res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = effect_n, LD = LD)
#> Validating input data.
#> Starting gradient boosting algorithm.
#> Gradient boosting for outcome 4 converged after 40 iterations!
#> Gradient boosting for outcome 5 converged after 59 iterations!
#> Gradient boosting for outcome 1 converged after 61 iterations!
#> Gradient boosting for outcome 3 converged after 91 iterations!
#> Gradient boosting for outcome 2 converged after 94 iterations!
#> Performing inference on colocalization events.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 186 194 168 205
#>
#> $`cos2:y2_y3_y5`
#> [1] 589 593
See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in LD mismatch and LD-free Colocalization).