--- title: "How To Use the Sparse Marginal Epistasis Test" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How To Use the Sparse Marginal Epistasis Test} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` The sparse marginal epistasis (SME) test performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data. By examining one SNP at a time, SME fits a linear mixed model to test for marginal epistasis. It explicitly models the combined additive effects from all SNPs a marginal epistatic effect that represents pairwise interactions involving a test SNP. The key to the SME formulation is that the interaction between the test SNP and other SNPs may be masked depending on additional information. Masking interactions that do not contribute to trait variance both maximizes the power of the inference as well as realizes the computational efficiency needed to analyze human Biobank scale data. This pages presents a toy example to show case what insights the implementation of the test provides. Sparse Marginal Epistasis test (SME) schematic manhattan plot **Figure 1. ** SME performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data. SME has improved power to detect marginal epistasis and runs 10x to 90x faster than state-of-the-art methods. ## Run SME with PLINK data ```{r setup} library(smer) library(dplyr) library(ggplot2) ``` ### Data Requirements and File Formats SME is implemented in R and requires your genetic data to be in PLINK format, which consists of three files: - `.bim`: Contains SNP information - `.bed`: Contains genetic data - `.fam`: Contains sample information **Note:** `sme()` cannot handle missing genotypes. Prior to running `sme()`, use PLINK to remove variants with missing genotypes or impute them. Additionally, your phenotype (trait) data should be in a separate file following PLINK's phenotype format. The `sme()` function includes parameters that let you control memory usage and computational resources. For detailed guidance on optimizing these settings for your system, please see our tutorial [How To Optimize the Memory Requirements of SME](https://lcrawlab.github.io/sme/articles/tutorial-memory-optimization.html). ### Specifying SNPs for Analysis When selecting which SNPs to analyze, you must provide their positions as **1-based indices**. These indices correspond to the row numbers in your `.bim` file, where the first SNP is index 1, the second is index 2, and so on. For complete details about all function parameters, please refer to the documentation of the `sme()` function. ```{r run_sme, eval = FALSE} # File inputs plink_file <- "path/to/plink/file" pheno_file <- "path/to/pheno/file" mask_file <- "path/to/mask/file" # Parameter inputs chun_ksize <- 10 n_randvecs <- 10 n_blocks <- 10 n_threads <- 5 # 1-based Indices of SNPs to be analyzed n_snps <- 100 snp_indices <- 1:n_snps sme_result <- sme( plink_file, pheno_file, mask_file, snp_indices, chunk_size, n_randvecs, n_blocks, n_threads ) ``` ```{r assign_data, include = FALSE} sme_result <- getting_started ``` This analysis uses simulated data for demonstration purposes. We simulated synthetic phenotypes from 5000 synthetic genotypes with 6000 SNPs. If you would like to learn how to simulate data, please refer to our tutorial [How To Simulate Traits](https://lcrawlab.github.io/sme/articles/tutorial-simulate-traits.html). ### Understanding the Results When you call the `sme()` function, it returns a list containing multiple elements. The most important one is called `summary`, which contains the main analysis results. These results are formatted as tidy data, making them compatible with popular R packages like `ggplot2` and `dplyr` for further analysis and visualization. ### Visualizing Genomic Associations We use Manhattan plots to visualize genome-wide analyses because they effectively highlight strong associations between genetic variants and traits. In this case, the Manhattan plot specifically shows statistical epistasis (interactions between genes). For reference, we've marked the true causal SNPs (Single Nucleotide Polymorphisms) in green on the plot - these are the genetic variants we included in our simulation as having real effects. ```{r manhattan_plot, fig.alt="Manhattan plot to illustrate the sparse marginal epistasis test"} sme_result$summary %>% ggplot(aes( x = index, y = -log10(p), color = true_gxg_snp )) + geom_point() + xlab("Position") + labs(color = "Epistatic SNP") ``` ### Understanding Variance Components and Effect Sizes SME estimates how much of the total trait variation can be explained by genetic interactions. In this simulation, we set the Phenotypic Variance Explained (PVE) to 5% for each SNP involved in epistatic interactions. The plot below shows how well our method recovered these effects. It displays two distributions: 1. The estimated PVE for SNPs we know are truly involved in epistatic interactions 2. The estimated PVE for SNPs that have no real epistatic effects The dashed line marks the true 5% PVE level we used in the simulation, allowing you to see how accurately the method estimated the actual effect sizes. ```{r pve_plot, fig.alt="PVE plot to illustrate the sparse marginal epistasis test"} sme_result$summary %>% ggplot(aes(x = true_gxg_snp, y = pve, fill = true_gxg_snp)) + geom_boxplot() + geom_hline(yintercept = 0.05, color = "grey40", linetype = "dashed") + annotate("text", x = 0.8, y = 0.055, label = "True per SNP epistatic PVE", color = "black") + xlab("Epistatic SNP") + ylab("Phenotypic Variance Explained") + theme(legend.position = "none") ``` ### Narrow Sense Heritability Estimates The SME method uses a linear mixed model to separate different sources of trait variation. One key component it estimates is narrow sense heritability ($h^2$), which measures how much trait variation can be explained by additive genetic effects. The plot below breaks down the estimated sources of trait variation: - "grm": Shows narrow sense heritability ($h^2$) - "gxg": Shows variance due to gene-by-gene interactions - "error": Shows unexplained variance, such as environmental effects In our simulation, we set the true narrow sense heritability to 30%, shown as a dashed line in the plot. This reference line helps you evaluate how accurately SME estimated the genetic components of trait variation. ```{r h2_plot, fig.alt="h2 plot to illustrate the sparse marginal epistasis test"} sme_result$vc_estimate %>% ggplot(aes(x = component, y = vc_estimate, fill = component)) + geom_boxplot() + geom_hline(yintercept = 0.3, color = "grey40", linetype = "dashed") + annotate("text", x = 0.7, y = 0.33, label = expression("True " * h^2), color = "black") + xlab("Component") + ylab("Variance Component Estimate") + theme(legend.position = "none") ``` The estimate of the narrow-sense heritability $h^2$ is much less variable because it is always informed by the same genetic relatedness matrix. In this small data example it overestimates the heritability but it is unbiased in general. ## SessionInfo ```{r seesionInfo} sessionInfo() ```