--- title: "Using SNPknock in R" author: "Matteo Sesia (msesia@stanford.edu)" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ../inst/REFERENCES.bib vignette: > %\VignetteIndexEntry{Using SNPknock in R} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- This vignette illustrates the usage of the `SNPknock` package for creating knockoffs of discrete Markov chains and hidden Markov models [@sesia2019]. For simplicity, we will use synthetic data. The `SNPknock` package also provides a simple interface to the genotype imputation software `fastPhase`, which can be used to fit hidden Markov models for genotype data. Since `fastPhase` is not available as an R package, this particular functionality of `SNPknock` cannot be demonstrated here. A tutorial showing how to use a combination of `SNPknock` and `fastPhase` to create knockoffs of genotype data can be found [here](genotypes.html). ## Knockoffs of discrete Markov chains First, we verify that the `SNPknock` can be loaded. ```{r} library(SNPknock) ``` We define a Markov chain model with 50 variables, each taking one of 5 possible values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries. ```{r} p=50; # Number of variables in the model K=5; # Number of possible states for each variable # Marginal distribution for the first variable pInit = rep(1/K,K) # Create p-1 transition matrices Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] + diag(rep(1,K)) Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } ``` We can sample 100 independent observations of this Markov chain using the `SNPknock` package. ```{r} set.seed(1234) X = sampleDMC(pInit, Q, n=100) print(X[1:5,1:10]) ``` Above, each row of `X` contains an independent realization of the Markov chain. A knockoff copy of `X` can be sampled as follows. ```{r} Xk = knockoffDMC(X, pInit, Q) print(Xk[1:5,1:10]) ``` ## Knockoffs of hidden Markov models We define a hidden Markov chain model with 50 variables, each taking one of 3 possible values, and 50 latent variables distributed as a Markov chain, each taking one of 5 values. We specify a uniform marginal distribution for the first variable in the chain and create 49 transition matrices with randomly sampled entries. We also define an emission distributions over the 3 possible emission states, for each of the 50 latent variables and the 5 latent states. ```{r} p=200; # Number of variables in the model K=5; # Number of possible states for each variable M=3; # Number of possible emission states for each variable # Marginal distribution for the first variable pInit = rep(1/K,K) # Create p-1 transition matrices Q = array(stats::runif((p-1)*K*K),c(p-1,K,K)) rho = stats::runif(p-1, min = 1, max = 50) for(j in 1:(p-1)) { Q[j,,] = Q[j,,] + rho[j] * diag(rep(1,K)) Q[j,,] = Q[j,,] / rowSums(Q[j,,]) } # Create p emission matrices pEmit = array(stats::runif(p*M*K),c(p,M,K)) for(j in 1:p) { pEmit[j,,] = sweep(pEmit[j,,],2,colSums(pEmit[j,,]),`/`) } ``` We can sample 100 independent observations of this hidden Markov model using the `SNPknock` package. ```{r} set.seed(1234) X = sampleHMM(pInit, Q, pEmit, n=100) print(X[1:5,1:10]) ``` Above, each row of `X` contains an independent realization of the hidden Markov model. A knockoff copy of `X` can be sampled as follows. ```{r} Xk = knockoffHMM(X, pInit, Q, pEmit) print(Xk[1:5,1:10]) ``` ## See also If you want to see how to use `SNPknock` to create knockoffs of genotype data, see the [genotypes vignette](genotypes.html). If you want to learn about `SNPknock` for large genome-wide association studies [@sesia2019multi], see