--- title: "How To Create a Mask File" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How To Create a Mask File} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(smer) ``` The Sparse Marginal Epistasis test (SME) requires a HDF5 file containing binary masking data. The effectiveness of the test relies on how informative the mask is for a specific trait. For instance, to test for marginal epistasis related to transcriptional activity, DNase I hypersensitivity sites data can be used to condition the detection on accessible chromatin regions. Sparse Marginal Epistasis test (SME) schematic DHS data **Figure 1.** The illustration shows open chromatin data. DNase I hypersensitive sites indicate transcriptional activity. The DNase-seq signal is converted to a binary mask, where genetic variants in inaccessible chromatin regions are excluded from the covariance matrix for marginal epistasis. ## Mask File Format The `sme()` function expects the mask data to be in an HDF5 file. The HDF5 format includes two primary object types: - Datasets - typed multidimensional arrays - Groups - container structures that can hold datasets and other groups ## Mask Format Requirements The mask data should be organized into the following groups and datasets: **Groups**: - `ld`: Contains SNPs in linkage disequilibrium (LD) with the focal SNP, which will be **excluded**. - `gxg`: Contains indices of SNPs used to condition the marginal epistasis test, which will be **included**. The required group names can be configured as input parameters of `sme()`. The defaults are `ld` and `gxg`. **Datasets**: - `ld/`: For each focal SNP ``, this dataset contains indices of SNPs in the same LD block as that SNP. These SNPs will be **excluded** from the gene-by-gene interaction covariance matrix. - `gxg/`: For each focal SNP ``, this dataset contains indices of SNPs to **include** in the the gene-by-gene interaction covariance matrix for focal SNP ``. **Important**: All indices in the mask file must be **zero-based** and should correspond to the zero-based row indices of the PLINK `.bim` file. This includes the dataset index (`` in `gxg/`) and the data itself. This zero-based indexing is necessary because the mask data is read by a C++ subroutine in `sme()`, which uses zero-based indexing, unlike R's one-based indexing for SNP indices in the function call. ## Creating and Using Mask Files The package provides utility functions to create, write, and read valid mask files for `sme()`. ```{r} hdf5_file <- tempfile() # Group names gxg_h5_group <- "gxg" ld_h5_group <- "ld" # Data (still in 1-based R indexing) include_gxg_snps <- 1:10 exclude_ld_snps <- 5:6 # Focal SNP (still in 1-based R indexing) focal_snp <- 4 # Dataset names dataset_name_pattern <- "%s/%s" # 0-based index! gxg_dataset <- sprintf(dataset_name_pattern, gxg_h5_group, focal_snp - 1) ld_dataset <- sprintf(dataset_name_pattern, ld_h5_group, focal_snp - 1) # Create an empty HDF5 file create_hdf5_file(hdf5_file) # Write LD data write_hdf5_dataset(hdf5_file, ld_dataset, exclude_ld_snps - 1) # 0-based index! # Write GXG data write_hdf5_dataset(hdf5_file, gxg_dataset, include_gxg_snps - 1) ``` We can check that the data was written correctly. ```{r} ld_read <- read_hdf5_dataset(hdf5_file, ld_dataset) gxg_read <- read_hdf5_dataset(hdf5_file, gxg_dataset) print(sprintf("Zero-based indices of SNPs to exclude: %s", str(ld_read))) print(sprintf("Zero-based indices of SNPs to include: %s", str(gxg_read))) ``` ## SessionInfo ```{r seesionInfo} sessionInfo() ```