The DYNATE
package accompanies the paper “Localizing Rare-Variant Association Regions via Multiple Testing Embedded in an Aggregation Tree”. The package is designed to pinpoint the disease-associated rare variant sets with a controlled false discovery rate in case-control study.
library(DYNATE)
Here, we show how to apply DYNATE
.
We require the input data is a data frame with a long format: each row is a rare variant (SNP) per sample. Specifically, the input data should contains 6 variables with name Sample.Name
, Sample.Type
, snpID
, domainID
, X1
, X2
. Where variables Sample.Name
, snpID
and domainID
indicate the Sample ID, SNP ID, and domain ID, respectively; Variable Sample.Type
indicates the case/control status of each sample; Variables X1
and X2
are covariates that could be considered in the analysis. The snp_dat
below is a toy simulated data with 6 variables and 210,454 rows. The data contains 2,000 samples (1,000 cases and 1,000 controls). In total 16,281 SNPs reside in 2,000 domains are considered in snp_dat
.
str(snp_dat)
#> 'data.frame': 54282 obs. of 6 variables:
#> $ snpID : num 1 1 1 1 1 1 2 2 2 2 ...
#> $ Sample.Type: chr "case" "case" "ctrl" "ctrl" ...
#> $ Sample.Name: int 58 695 1373 1504 1898 1938 110 199 292 326 ...
#> $ domainID : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ X1 : num 0.38242 -0.46531 -0.23738 -0.0889 0.00982 ...
#> $ X2 : int 1 1 1 0 1 1 1 0 1 0 ...
First, we set the tuning parameters as follows. Please refer to the paper for detailed tuning parameters selection procedure.
<- 5 # leaf size
M <- 3 # layer number
L <- 0.05 # desired FDR alpha
Second, we use Test_Leaf
function to construct leaves and generate leaf P-values for the case-control study.
# Model consider covariates effect:
<- Sys.time()
t1 <- Test_Leaf(snp_dat=snp_dat,thresh_val=M)
p_leaf <- Sys.time()
t2 -t1
t2#> Time difference of 1.350046 secs
In the output data frames p_leaf
, each row links to a rare variant (SNPs), and the number of rows equals the number of rare variants (SNPs) we considered (SNPs that link to a leaf with p-value=1 are excluded for maintaining the algorithm stability). The data frame includes 5 variables. In the data frame, variable L1
is leaf ID; variable pvals
is the leaf level p values; variable Test
indicates the name of the statistical test to generate the leaf level p values (FET or score).
str(p_leaf)
#> Classes 'data.table' and 'data.frame': 4214 obs. of 5 variables:
#> $ L1 : int 1 2 3 3 4 5 5 6 7 8 ...
#> $ pvals : num 0.453 0.887 0.548 0.548 0.895 ...
#> $ snpID : num 1 2 3 4 5 6 7 8 9 10 ...
#> $ domainID: int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Test : chr "FET" "FET" "FET" "FET" ...
#> - attr(*, ".internal.selfref")=<externalptr>
Finally, we use the function DYNATE
to conduct dynamic and hierarchical testing based on the leaf level p values.
<- DYNATE(struct_map=p_leaf,L=L,alpha=alpha) out
In the output data frames out
, each row links to a unique SNP that is detected by DYNATE. The variables snpID
, L1
, and domainID
link to the detected SNP ID, leaf ID, and domain ID, respectively; Variable Test
links to the name of the statistical test we applied (FET or score); Variable pvals1
links to the leaf level p-values; Variable Layer
indicates in which layer the SNP is detected.
str(out)
#> tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
#> $ L1 : int [1:9] 205 1357 1357 1462 2085 2479 3178 546 547
#> $ snpID : num [1:9] 277 1811 1812 1957 2785 ...
#> $ domainID: int [1:9] 20 168 168 181 312 376 493 66 66
#> $ Test : chr [1:9] "FET" "FET" "FET" "FET" ...
#> $ pvals1 : num [1:9] 1.25e-08 4.76e-34 4.76e-34 6.65e-05 2.09e-13 ...
#> $ Layer : int [1:9] 1 1 1 1 1 1 1 2 2