--- title: "Input Data Format" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Input Data Format} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(colocboost) ``` This vignette documents the standard input data formats of `colocboost`. # 1. Individual Level Data For analyses using individual-level data, the basic format for single trait is as follows: - `X` is an $N \times P$ matrix with $N$ individuals and $P$ variants. Including variant names as column names is highly recommended, especially when working with multiple $X$ matrices and $Y$ vectors. - `Y` is a length $N$ vector containing phenotype values for the same $N$ individuals as $X$. The input format for multiple traits is similar, but `X` should be a list of genotype matrices, each corresponding to a different trait. `Y` should also be a list of phenotype vectors. For example: - `X = list(X1, X2, X3, X4, X5)` where each `Xi` is a matrix for trait `i` - with the dimension of $N_i \times P_i$, where $N_i$ and $P_i$ do not need to be the same for different traits. - `Y = list(Y1, Y2, Y3, Y4, Y5)` where each `Yi` is a vector for trait `i` - with $N_i$ individuals. `colocboost` also offers flexible input options (see detailed usage with different input formats, refer to [Individual Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Individual_Level_Colocalization.html)): - Single $X$ matrix with $N \times P$, and $Y$ matrix with $N \times L$ for $L$ traits. - Multiple $X$ matrices and unmatched $Y$ vectors with a mapping dictionary (example shown in section 3 below). # 2. Summary Statistics For analyses using summary statistics, the basic format for single trait is as follows: - `sumstat` is a data frame with required columns `z` or (`beta`, `sebeta`), and optional columns but highly recommended `n` and `variant`. ```{r summary-stats-example} data(Sumstat_5traits) head(Sumstat_5traits$sumstat[[1]]) ``` - `z` or (`beta`, `sebeta`) - required: either z-score or (effect size and standard error) - `n` - highly recommended: sample size for the summary statistics, it is highly recommendation to provide. - `variant` - highly recommended: required if sumstat for different outcomes do not have the same number of variables (multiple sumstat and multiple LD). - `LD` is a matrix of LD. This matrix does not need to contain the exact same variants as in `sumstat`, but the `colnames` and `rownames` of `LD` should include the `variant` names for proper alignment. The input format for multiple traits is similar, but `sumstat` should be a list of data frames `sumstat = list(sumstat1, sumstat2, sumstat3)`. The flexibility of input format for multiple traits is as follows (see detailed usage with different input formats, refer to [Summary Statistics Colocalization](https://statfungen.github.io/colocboost/articles/Summary_Level_Colocalization.html)): - One LD matrix with a superset of variants in `sumstat` for all traits is allowed. - Multiple LD matrices, each corresponding to a different trait, are also allowed for the trait-specific LD structure. - Multiple LD matrices and unmatched `sumstat` data frames with a mapping dictionary are also allowed (example shown in section 3 below). # 3. Optional: mapping between arbitrary input $X$ and $Y$ For analysis when including multiple genotype matrices `X` with unmatched arbitrary phenotype vectors `Y`, a mapping dictionary `dict_YX` is required to indicate the relationship between `X` and `Y`. Similarly, when multiple LD matrices with unmatched arbitrary multiple summary statistics `sumstat` are used, a mapping dictionary `dict_sumstatLD` is required to indicate the relationship between `sumstat` and `LD`. For example, considering three genotype matrices `X = list(X1, X2, X3)` and 6 phenotype vectors `Y = list(Y1, Y2, Y3, Y4, Y5, Y6)`, where - `X1` is for trait 1, trait 2, trait 3 - `X2` is for trait 4, trait 5 - `X3` is for trait 6 Then, you need to define a 6 by 2 matrix mapping dictionary `dict_YX` as follows: - The first column should be `c(1,2,3,4,5,6)` for 6 traits. - The second column should be `c(1,1,1,2,2,3)` for 3 genotype matrices. Here, each row indicates the trait index and the corresponding genotype matrix index. ```{r dict_YX} dict_YX <- cbind(c(1,2,3,4,5,6), c(1,1,1,2,2,3)) dict_YX ``` # 4. HyPrColoc compatible format: effect size and standard error matrices ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix. For example, when analyze $L$ traits for the same $P$ variants with the specified effect size and standard error matrices: - `effect_est` (required) is $P \times L$ matrix of variable regression coefficients (i.e. regression beta values) in the genomic region. - `effect_se` (required) is $P \times L$ matrix of standard errors for the regression coefficients. - `effect_n` (highly recommended) is either a scalar or a vector of sample sizes for estimating regression coefficients. - `LD` (optional) is LD matrix for the $P$ variants. If it is not provided, it will apply LD-free ColocBoost. See more details about HyPrColoc compatible format in [Summary Statistics Colocalization](https://statfungen.github.io/colocboost/articles/Summary_Level_Colocalization.html)). See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in [LD mismatch and LD-free Colocalization](https://statfungen.github.io/colocboost/articles/LD_Free_Colocalization.html)).