---
title: "Bioinformatics Pipeline for ColocBoost"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bioinformatics Pipeline for ColocBoost}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost to perform colocalization analysis with `colocboost`. 

- See more details about functions in the package `pecotmr` with [link](https://github.com/StatFunGen/pecotmr/tree/main) and 
`colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R). 
- See more details about input data preparation in `xqtl_protocol` with [link](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html).


# 1. Loading Data using `colocboost_analysis_pipeline` function


This function harmonizes the input data and prepares it for colocalization analysis. 
In this section, we introduce how to load the regional data required for the ColocBoost analysis using the `load_multitask_regional_data` function. 
This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data)
or summary statistics (sumstats, LD). Run `load_regional_univariate_data` and `load_rss_data` multiple times for different datasets.


Below are the input parameters for this function:

## 1.1. Loading individual-level data from multiple cohorts


- **`region`** (required): A string of `chr:start-end` for the phenotype region you want to analyze. 
- **`genotype_list`**: A vector of paths for PLINK bed files containing genotype data. 
- **`phenotype_list`**: A vector of paths for phenotype file names.
- **`covariate_list`**: A vector of paths for covariate file names corresponding to the phenotype file vector.
- **`conditions_list_individual`**: A vector of strings representing different conditions or groups.
- **`match_geno_pheno`**: A vector of indices of phenotypes matched to genotype if multiple genotype PLINK files are used.
- **`maf_cutoff`**: Minimum minor allele frequency (MAF) cutoff. Default is 0.
- **`mac_cutoff`**: Minimum minor allele count (MAC) cutoff. Default is 0.
- **`xvar_cutoff`**: Minimum variance cutoff. Default is 0.
- **`imiss_cutoff`**: Maximum individual missingness cutoff. Default is 0.
- **`association_window`**: A string of `chr:start-end` for the association analysis window (cis or trans). If not provided, all genotype data will be loaded.
- **`extract_region_name`**: A list of vectors of strings (e.g., gene ID `ENSG00000269699`) to subset the information when there are multiple regions available. Default is `NULL`.
- **`region_name_col`**: Column name containing the region name. Default is `NULL`.
- **`keep_indel`**: Logical indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`.
- **`keep_samples`**: A vector of sample names to keep. Default is `NULL`.

**Illustrated example**

The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts, where the first cohort has 2 phenotypes and the second cohort has 1 phenotype.

```{r, data-loader-individual}
# Example of loading individual-level data
region = "chr1:1000000-2000000"
genotype_list = c("plink_cohort1.1.bed", "plink_cohort1.2.bed")
phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz")
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2")
match_geno_pheno = c(1,1,2) # indices of phenotypes matched to genotype
association_window = "chr1:1000000-2000000" # same as region for cis-analysis

# Following parameters need to be set according to your data
maf_cutoff = 0.01
mac_cutoff = 10
xvar_cutoff = 0
imiss_cutoff = 0.9

# More advanced parameters see pecotmr::load_multitask_regional_data()
```


## 1.2. Loading summary statistics from multiple cohorts or datasets

- **`sumstat_path_list`**: A vector of file paths to the summary statistics.
- **`column_file_path_list`**: A vector of file paths to the column mapping files.
- **`LD_meta_file_path_list`**: A vector of paths to LD metadata files. LD metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". 
"start" and "end" denote the positions of LD blocks, and "path" is the path of each LD block, optionally including `bim` file paths.
- **`match_LD_sumstat`**: Logical indicating whether to match LD blocks with summary statistics.
- **`conditions_list_sumstat`**: A vector of strings representing different sumstats.
- **`n_samples`**: User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file.
- **`n_cases`**: User-specified number of cases.
- **`n_controls`**: User-specified number of controls.
- **`region`**: The region where tabix is used to subset the input dataset.
- **`extract_sumstats_region_name`**: User-specified gene/phenotype name used to further subset the phenotype data.
- **`sumstats_region_name_col`**: Filter this specific column for the `extract_sumstats_region_name`.
- **`comment_string`**: Comment sign in the column mapping file, default is `#`.
- **`extract_coordinates`**: Optional data frame with columns "chrom" and "pos" for specific coordinates extraction.

**Illustrated example**

The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.

```{r, data-loader-sumstat}
# Example of loading summary statistics
sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz")
column_file_path_list = c("mapping_columns_1.yml", "mapping_columns_2.yml")
LD_meta_file_path_list = "ld_meta_file.tsv"
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_sumstat = c("sumstat_1", "sumstat_2")

# Following parameters need to be set according to your data
n_samples = c(0, 0)
n_cases = c(10000, 20000)
n_controls = c(20000, 40000)

# More advanced parameters see pecotmr::load_multitask_regional_data()
```


# 2. Perform ColocBoost using `colocboost_analysis_pipeline` function

In this section, we perform the colocalization analysis using the `colocboost_analysis_pipeline` function. Below are the input parameters for this function:


- **`region_data`**: The output of the `load_multitask_regional_data` function.
- **`focal_trait`**: Name of the trait if performing disease-prioritized ColocBoost.
- **`event_filters`**: A list of patterns for filtering events based on context names. 
Example: for sQTL, `list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:")`.
- **`maf_cutoff`**: A scalar to remove variants with maf < maf_cutoff, default is 0.005.
- **`pip_cutoff_to_skip_ind`**: A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0.
- **`pip_cutoff_to_skip_sumstat`**: A vector of cutoff values for skipping analysis based on PIP values for each sumstat. Default is 0.
- **`qc_method`**: Quality control method to use. Options are "rss_qc", "dentist", or "slalom" (default: "rss_qc").
- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE).
- **`impute_opts`**: A list of imputation options including rcond, R2_threshold, and minimum_ld (default: `list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)`).


```{r, colocboost-analysis}
# region_data <- load_multitask_regional_data(...)
# res <- colocboost_analysis_pipeline(region_data)
```