Input Data Format

library(colocboost)

This vignette documents the standard input data formats of colocboost.

1. Individual Level Data

For analyses using individual-level data, the basic format for single trait is as follows:

The input format for multiple traits is similar, but X should be a list of genotype matrices, each corresponding to a different trait. Y should also be a list of phenotype vectors. For example:

colocboost also offers flexible input options (see detailed usage with different input formats, refer to Individual Level Data Colocalization):

2. Summary Statistics

For analyses using summary statistics, the basic format for single trait is as follows:

data(Sumstat_5traits)
head(Sumstat_5traits$sumstat[[1]])
#>              z    n variant
#> 451 -1.0945531 1153    rs_1
#> 452 -0.4113347 1153    rs_2
#> 453 -0.4113347 1153    rs_3
#> 454 -0.7467923 1153    rs_4
#> 455 -0.3018575 1153    rs_5
#> 456 -0.5256479 1153    rs_6
- `z` or (`beta`, `sebeta`) - required: either z-score or (effect size and standard error)
- `n` - highly recommended: sample size for the summary statistics, it is highly recommendation to provide.
- `variant` - highly recommended: required if sumstat for different outcomes do not have the same number of variables (multiple sumstat and multiple LD).

The input format for multiple traits is similar, but sumstat should be a list of data frames sumstat = list(sumstat1, sumstat2, sumstat3). The flexibility of input format for multiple traits is as follows (see detailed usage with different input formats, refer to Summary Statistics Colocalization):

3. Optional: mapping between arbitrary input \(X\) and \(Y\)

For analysis when including multiple genotype matrices X with unmatched arbitrary phenotype vectors Y, a mapping dictionary dict_YX is required to indicate the relationship between X and Y. Similarly, when multiple LD matrices with unmatched arbitrary multiple summary statistics sumstat are used, a mapping dictionary dict_sumstatLD is required to indicate the relationship between sumstat and LD.

For example, considering three genotype matrices X = list(X1, X2, X3) and 6 phenotype vectors Y = list(Y1, Y2, Y3, Y4, Y5, Y6), where

Then, you need to define a 6 by 2 matrix mapping dictionary dict_YX as follows:

Here, each row indicates the trait index and the corresponding genotype matrix index.

dict_YX <- cbind(c(1,2,3,4,5,6), c(1,1,1,2,2,3))
dict_YX
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    1
#> [3,]    3    1
#> [4,]    4    2
#> [5,]    5    2
#> [6,]    6    3

4. HyPrColoc compatible format: effect size and standard error matrices

ColocBoost also provides a flexibility to use HyPrColoc compatible format for summary statistics with and without LD matrix. For example, when analyze \(L\) traits for the same \(P\) variants with the specified effect size and standard error matrices:

See more details about HyPrColoc compatible format in Summary Statistics Colocalization).

See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in LD mismatch and LD-free Colocalization).