---
title: "Using SNPknock with Genetic Data"
author: "Matteo Sesia (msesia@stanford.edu)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
%\VignetteIndexEntry{Using SNPknock with Genetic Data}
%\VignetteEngine{knitr::rmarkdown}
%\usepackage[utf8]{inputenc}
---
This vignette illustrates the basic usage of the `SNPknock` package in combination with the phasing software `fastphase` [@scheet2006] to create knockoffs of unphased genotypes or phased haplotypes [@sesia2019, @sesia2019multi]. Since `fastphase` is not available as an R package, this functionality of `SNPknock` requires the user to first obtain a copy of `fastphase`.
To learn more about the application of `SNPknock` to large genome-wide association studies [@sesia2019multi], visit: .
## Obtaining `fastphase`
`fastphase` is a phasing and imputation tool based on the hidden Markov model described in [@scheet2006].
Binary executables for Linux and Mac OS are available from .
Before continuing with this tutorial, download the `fastphase` tarball from the above link and extract the `fastphase` executable file into a convenient directory (e.g. "~/bin/").
## Knockoffs for unphased genotypes
### Fitting the hidden Markov model on genotype data
A small synthetic dataset of 1454 unphased genotype SNPs from 100 individuals can be found in the package installation directory.
We can load it with:
```{r}
library(SNPknock)
X_file = system.file("extdata", "genotypes.RData", package = "SNPknock")
load(X_file)
table(X)
```
Below, we show how to fit a hidden Markov model to this data, with the help of `fastphase`.
Since `fastphase` takes as input genotype sequences in ".inp" format, we must first convert
the X matrix by calling `writeXtoInp`.
By default, this function will write onto a temporary file in the R temporary directory.
```{r}
# Convert X into the suitable fastphase input format, write it into a temporary file
# and return the path to that file.
Xinp_file = writeXtoInp(X)
```
Assuming that we have already downloaded `fastphase`, we can call it to fit the hidden Markov model to X.
```{r}
fp_path = "~/bin/fastphase" # Path to the fastphase executable
# Call fastphase and return the path to the parameter estimate files
fp_outPath = runFastPhase(fp_path, Xinp_file, K = 12, numit = 15)
```
Above, the `SNPknock` package could not find `fastphase` because we did not provide the correct path
(we cannot include third-party executable files within this package). However, if you install `fastphase`
separately and provide `SNPknock` with the correct path, this will work.
If the previous step worked for you, you can find the parameter estimates produced by `fastphase`
in the following files:
```{r eval=FALSE}
r_file = paste(fp_outPath, "_rhat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
char_file = paste(fp_outPath, "_origchars", sep="")
```
Otherwise, for the sake of this tutorial, you can use the example parameter files provided
in the package installation directory:
```{r}
r_file = system.file("extdata", "genotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "genotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "genotypes_thetahat.txt", package = "SNPknock")
char_file = system.file("extdata", "genotypes_origchars", package = "SNPknock")
```
Then, we can construct the hidden Markov model with:
```{r}
hmm = loadHMM(r_file, alpha_file, theta_file, char_file)
```
Generating knockoff genotypes
-------------------------------
Finally, we can use the hidden Markov model created above to generate knockoffs.
```{r}
Xk = knockoffGenotypes(X, hmm$r, hmm$alpha, hmm$theta)
table(Xk)
```
## Knockoffs for phased haplotypes
### Fitting the hidden Markov model on haplotype data
A small synthetic dataset of 1454 phased haplotype SNPs from 100 individuals can be found in the package installation directory.
We can load it with:
```{r}
library(SNPknock)
H_file = system.file("extdata", "haplotypes.RData", package = "SNPknock")
load(H_file)
table(H)
```
Below, we show how to fit a hidden Markov model to this data, with the help of `fastphase`.
Since `fastphase` takes as input haplotype sequences in ".inp" format, we must first convert
the H matrix by calling `writeXtoInp`.
By default, this function will write onto a temporary file in the R temporary directory.
```{r}
# Convert X into the suitable fastphase input format, write it into a temporary file
# and return the path to that file.
Hinp_file = writeXtoInp(H, phased = TRUE)
```
Assuming that we have already downloaded `fastphase`, we can call it to fit the hidden Markov model to X.
```{r}
fp_path = "~/bin/fastphase" # Path to the fastphase executable
# Call fastphase and return the path to the parameter estimate files
fp_outPath = runFastPhase(fp_path, Hinp_file, K = 12, numit = 15, phased = TRUE)
```
Above, the `SNPknock` package could not find `fastphase` because we did not provide the correct path
(we cannot include third-party executable files within this package). However, if you install `fastphase`
separately and provide `SNPknock` with the correct path, this will work.
If the previous step worked for you, you can find the parameter estimates produced by `fastphase`
in the following files:
```{r eval=FALSE}
r_file = paste(fp_outPath, "_rhat.txt", sep="")
alpha_file = paste(fp_outPath, "_alphahat.txt", sep="")
theta_file = paste(fp_outPath, "_thetahat.txt", sep="")
char_file = paste(fp_outPath, "_origchars", sep="")
```
Otherwise, for the sake of this tutorial, you can use the example parameter files provided
in the package installation directory:
```{r}
r_file = system.file("extdata", "haplotypes_rhat.txt", package = "SNPknock")
alpha_file = system.file("extdata", "haplotypes_alphahat.txt", package = "SNPknock")
theta_file = system.file("extdata", "haplotypes_thetahat.txt", package = "SNPknock")
char_file = system.file("extdata", "haplotypes_origchars", package = "SNPknock")
```
Then, we can construct the hidden Markov model with:
```{r}
hmm = loadHMM(r_file, alpha_file, theta_file, char_file)
```
### Generating knockoff haplotypes
Finally, we can use the hidden Markov model created above to generate knockoffs.
```{r}
Hk = knockoffHaplotypes(H, hmm$r, hmm$alpha, hmm$theta)
table(Hk)
```
## See also
If you want to see some basic usage of `SNPknock`, see the [introductory vignette](SNPknock.html).
If you want to learn about `SNPknock` for large genome-wide association studies [@sesia2019multi], see .