--- title: "Using SNPknock with Genetic Data" author: "Matteo Sesia (msesia@stanford.edu)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Using SNPknock with Genetic Data} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- This vignette illustrates the basic usage of the `SNPknock` package in combination with the phasing software `fastphase` [@scheet2006] to create knockoffs of unphased genotypes or phased haplotypes [@sesia2019, @sesia2019multi]. Since `fastphase` is not available as an R package, this functionality of `SNPknock` requires the user to first obtain a copy of `fastphase`. To learn more about the application of `SNPknock` to large genome-wide association studies [@sesia2019multi], visit: . ## Obtaining `fastphase` `fastphase` is a phasing and imputation tool based on the hidden Markov model described in [@scheet2006]. Binary executables for Linux and Mac OS are available from . Before continuing with this tutorial, download the `fastphase` tarball from the above link and extract the `fastphase` executable file into a convenient directory (e.g. "~/bin/"). ## Knockoffs for unphased genotypes ### Fitting the hidden Markov model on genotype data A small synthetic dataset of 1454 unphased genotype SNPs from 100 individuals can be found in the package installation directory. We can load it with: ```{r} library(SNPknock) X_file = system.file("extdata", "genotypes.RData", package = "SNPknock") load(X_file) table(X) ``` Below, we show how to fit a hidden Markov model to this data, with the help of `fastphase`. Since `fastphase` takes as input genotype sequences in ".inp" format, we must first convert the X matrix by calling `writeXtoInp`. By default, this function will write onto a temporary file in the R temporary directory. ```{r} # Convert X into the suitable fastphase input format, write it into a temporary file # and return the path to that file. Xinp_file = writeXtoInp(X) ``` Assuming that we have already downloaded `fastphase`, we can call it to fit the hidden Markov model to X. ```{r} fp_path = "~/bin/fastphase" # Path to the fastphase executable # Call fastphase and return the path to the parameter estimate files fp_outPath = runFastPhase(fp_path, Xinp_file, K = 12, numit = 15) ``` Above, the `SNPknock` package could not find `fastphase` because we did not provide the correct path (we cannot include third-party executable files within this package). However, if you install `fastphase` separately and provide `SNPknock` with the correct path, this will work. If the previous step worked for you, you can find the parameter estimates produced by `fastphase` in the following files: ```{r eval=FALSE} r_file = paste(fp_outPath, "_rhat.txt", sep="") alpha_file = paste(fp_outPath, "_alphahat.txt", sep="") theta_file = paste(fp_outPath, "_thetahat.txt", sep="") char_file = paste(fp_outPath, "_origchars", sep="") ``` Otherwise, for the sake of this tutorial, you can use the example parameter files provided in the package installation directory: ```{r} r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "genotypes_origchars", package = "SNPknock") ``` Then, we can construct the hidden Markov model with: ```{r} hmm = loadHMM(r_file, alpha_file, theta_file, char_file) ``` Generating knockoff genotypes ------------------------------- Finally, we can use the hidden Markov model created above to generate knockoffs. ```{r} Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta) table(Xk) ``` ## Knockoffs for phased haplotypes ### Fitting the hidden Markov model on haplotype data A small synthetic dataset of 1454 phased haplotype SNPs from 100 individuals can be found in the package installation directory. We can load it with: ```{r} library(SNPknock) H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock") load(H_file) table(H) ``` Below, we show how to fit a hidden Markov model to this data, with the help of `fastphase`. Since `fastphase` takes as input haplotype sequences in ".inp" format, we must first convert the H matrix by calling `writeXtoInp`. By default, this function will write onto a temporary file in the R temporary directory. ```{r} # Convert X into the suitable fastphase input format, write it into a temporary file # and return the path to that file. Hinp_file = writeXtoInp(H, phased = TRUE) ``` Assuming that we have already downloaded `fastphase`, we can call it to fit the hidden Markov model to X. ```{r} fp_path = "~/bin/fastphase" # Path to the fastphase executable # Call fastphase and return the path to the parameter estimate files fp_outPath = runFastPhase(fp_path, Hinp_file, K = 12, numit = 15, phased = TRUE) ``` Above, the `SNPknock` package could not find `fastphase` because we did not provide the correct path (we cannot include third-party executable files within this package). However, if you install `fastphase` separately and provide `SNPknock` with the correct path, this will work. If the previous step worked for you, you can find the parameter estimates produced by `fastphase` in the following files: ```{r eval=FALSE} r_file = paste(fp_outPath, "_rhat.txt", sep="") alpha_file = paste(fp_outPath, "_alphahat.txt", sep="") theta_file = paste(fp_outPath, "_thetahat.txt", sep="") char_file = paste(fp_outPath, "_origchars", sep="") ``` Otherwise, for the sake of this tutorial, you can use the example parameter files provided in the package installation directory: ```{r} r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock") alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock") theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock") char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock") ``` Then, we can construct the hidden Markov model with: ```{r} hmm = loadHMM(r_file, alpha_file, theta_file, char_file) ``` ### Generating knockoff haplotypes Finally, we can use the hidden Markov model created above to generate knockoffs. ```{r} Hk = knockoffHaplotypes(H, hmm$r, hmm$alpha, hmm$theta) table(Hk) ``` ## See also If you want to see some basic usage of `SNPknock`, see the [introductory vignette](SNPknock.html). If you want to learn about `SNPknock` for large genome-wide association studies [@sesia2019multi], see .