DYNATE

The DYNATE package accompanies the paper “Localizing Rare-Variant Association Regions via Multiple Testing Embedded in an Aggregation Tree”. The package is designed to pinpoint the disease-associated rare variant sets with a controlled false discovery rate in case-control study.

library(DYNATE)

We require the input data is a data frame with a long format: each row is a rare variant (SNP) per sample. Specifically, the input data should contains 6 variables with name Sample.Name, Sample.Type, snpID, domainID, X1, X2. Where variables Sample.Name, snpID and domainID indicate the Sample ID, SNP ID, and domain ID, respectively; Variable Sample.Type indicates the case/control status of each sample; Variables X1 and X2 are covariates that could be considered in the analysis. The snp_dat below is a toy simulated data with 6 variables and 210,454 rows. The data contains 2,000 samples (1,000 cases and 1,000 controls). In total 16,281 SNPs reside in 2,000 domains are considered in snp_dat.

str(snp_dat)
#> 'data.frame':    54282 obs. of  6 variables:
#>  $ snpID      : num  1 1 1 1 1 1 2 2 2 2 ...
#>  $ Sample.Type: chr  "case" "case" "ctrl" "ctrl" ...
#>  $ Sample.Name: int  58 695 1373 1504 1898 1938 110 199 292 326 ...
#>  $ domainID   : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ X1         : num  0.38242 -0.46531 -0.23738 -0.0889 0.00982 ...
#>  $ X2         : int  1 1 1 0 1 1 1 0 1 0 ...

First, we set the tuning parameters as follows. Please refer to the paper for detailed tuning parameters selection procedure.

M <- 5 # leaf size
L <- 3 # layer number
alpha <- 0.05 # desired FDR

Second, we use Test_Leaf function to construct leaves and generate leaf P-values for the case-control study.

# Model consider covariates effect:
t1 <- Sys.time()
p_leaf <- Test_Leaf(snp_dat=snp_dat,thresh_val=M)
t2 <- Sys.time()
t2-t1
#> Time difference of 1.350046 secs

In the output data frames p_leaf, each row links to a rare variant (SNPs), and the number of rows equals the number of rare variants (SNPs) we considered (SNPs that link to a leaf with p-value=1 are excluded for maintaining the algorithm stability). The data frame includes 5 variables. In the data frame, variable L1 is leaf ID; variable pvals is the leaf level p values; variable Test indicates the name of the statistical test to generate the leaf level p values (FET or score).

str(p_leaf)
#> Classes 'data.table' and 'data.frame':   4214 obs. of  5 variables:
#>  $ L1      : int  1 2 3 3 4 5 5 6 7 8 ...
#>  $ pvals   : num  0.453 0.887 0.548 0.548 0.895 ...
#>  $ snpID   : num  1 2 3 4 5 6 7 8 9 10 ...
#>  $ domainID: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Test    : chr  "FET" "FET" "FET" "FET" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

Finally, we use the function DYNATE to conduct dynamic and hierarchical testing based on the leaf level p values.

out <- DYNATE(struct_map=p_leaf,L=L,alpha=alpha)

In the output data frames out, each row links to a unique SNP that is detected by DYNATE. The variables snpID, L1, and domainID link to the detected SNP ID, leaf ID, and domain ID, respectively; Variable Test links to the name of the statistical test we applied (FET or score); Variable pvals1 links to the leaf level p-values; Variable Layer indicates in which layer the SNP is detected.

str(out)
#> tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ L1      : int [1:9] 205 1357 1357 1462 2085 2479 3178 546 547
#>  $ snpID   : num [1:9] 277 1811 1812 1957 2785 ...
#>  $ domainID: int [1:9] 20 168 168 181 312 376 493 66 66
#>  $ Test    : chr [1:9] "FET" "FET" "FET" "FET" ...
#>  $ pvals1  : num [1:9] 1.25e-08 4.76e-34 4.76e-34 6.65e-05 2.09e-13 ...
#>  $ Layer   : int [1:9] 1 1 1 1 1 1 1 2 2