Bioinformatics Pipeline for ColocBoost

This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost to perform colocalization analysis with colocboost.

1. Loading Data using colocboost_analysis_pipeline function

This function harmonizes the input data and prepares it for colocalization analysis. In this section, we introduce how to load the regional data required for the ColocBoost analysis using the load_multitask_regional_data function. This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets.

Below are the input parameters for this function:

1.1. Loading individual-level data from multiple cohorts

Illustrated example

The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts, where the first cohort has 2 phenotypes and the second cohort has 1 phenotype.

# Example of loading individual-level data
region = "chr1:1000000-2000000"
genotype_list = c("plink_cohort1.1.bed", "plink_cohort1.2.bed")
phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz")
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2")
match_geno_pheno = c(1,1,2) # indices of phenotypes matched to genotype
association_window = "chr1:1000000-2000000" # same as region for cis-analysis

# Following parameters need to be set according to your data
maf_cutoff = 0.01
mac_cutoff = 10
xvar_cutoff = 0
imiss_cutoff = 0.9

# More advanced parameters see pecotmr::load_multitask_regional_data()

1.2. Loading summary statistics from multiple cohorts or datasets

Illustrated example

The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.

# Example of loading summary statistics
sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz")
column_file_path_list = c("mapping_columns_1.yml", "mapping_columns_2.yml")
LD_meta_file_path_list = "ld_meta_file.tsv"
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_sumstat = c("sumstat_1", "sumstat_2")

# Following parameters need to be set according to your data
n_samples = c(0, 0)
n_cases = c(10000, 20000)
n_controls = c(20000, 40000)

# More advanced parameters see pecotmr::load_multitask_regional_data()

2. Perform ColocBoost using colocboost_analysis_pipeline function

In this section, we perform the colocalization analysis using the colocboost_analysis_pipeline function. Below are the input parameters for this function:

# region_data <- load_multitask_regional_data(...)
# res <- colocboost_analysis_pipeline(region_data)