%\VignetteIndexEntry{TailRank} %\VignetteKeywords{TailRank,tail rank test} %\VignetteDepends{TailRank,methods,Biobase} %\VignettePackage{TailRank} \documentclass{article} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{The Tail Rank Test} \author{Kevin R. Coombes} \begin{document} \maketitle \tableofcontents \section{Introduction} OOMPA is a suite of object-oriented tools for processing and analyzing large biological data sets, such as those arising from mRNA expression microarrays or mass spectrometry proteomics. This vignette documents the tail rank test, which provides an alternative method for discovering potential biomarkers in large data sets. The idea is that one starts with a target specificity for a gene as a univariate biomarker, and uses the ``normal'' or ``baseline'' samples to estimate a threshold that yields that specificity. Then, for each gene, one counts the number of ``cancer'' or ``experimental'' samples that exceed the gene-specific threshold. Significance is determined based on control of the family-wise error rate (FWER). \section{Getting Started} As usual, we start by loading the library. <>= library(TailRank) @ The \Rpackage{TailRank} package uses an auxiliary package to supply sample data that we can use to illustrate the methods. The sample data consists of a subset containing $2000$ genes from a prostate cancer study on glass arrays reported by Lapointe and colleagues [1]. The next set of commands loads the data. <>= library(oompaData) data(expression.data) data(gene.info) data(clinical.info) dim(clinical.info) @ There are $112$ samples in the study. The \Robject{Subgroups} column of the \Robject{clinical.info} data frame refers to the subgroups discovered in the original publication by clustering based on the gene expression data. The \Robject{ChipType} column identifies the two different generations of glass arrays that were combined in the study. The \Robject{Status} column classifies the samples as normal prostate (N), primary prostate tumor (T), or lymph node metastasis (L). Since there is a natural order to this status in terms of the severity of the disease, we are going to make certain that it is used: <>= clinical.info$Status <- ordered(clinical.info$Status, levels=c("N", "T", "L")) summary(clinical.info) @ \section{Performing the Tail Rank Test} The main function in the package is the \Rfunction{TailRankTest}. We start by invoking this function with the default values of the arguments. The summary includes details on the parameters that were used, along with the fact that $49$ of the $2000$ genes were more highly expressed in non-normal samples than would be expected by chance, based on a $5\%$ FWER. <>= trt <- TailRankTest(expression.data, clinical.info$Status) #$ summary(trt) @ In the next example, we increase both the target specificity (from the default of $95\%$ to a desired value of $99\%$) and the desired confidence limits (to $99\%$ from the default of $95\%$). With this more stringent criteria, only $25$ of the genes remain significant. <>= trt2 <- TailRankTest(expression.data, clinical.info$Status, specificity=0.99, confidence=0.99) #$ summary(trt2) @ \subsection{Which genes are significant?} After performing an analysis that identifies a gene list like this, it is, of course, natural to want to know which genes were selected. The \Rfunction{as.logical} method converts the results of the tail rank test into a logical vector that selects these significant genes. Using this method, we can verify that the $25$ genes selected by the more stringent criteria are a subset of the $49$ genes selected using the weaker criteria. <>= sel <- as.logical(trt) sel2 <- as.logical(trt2) sum(sel2 & sel) @ Since this vector serves as index into the \Robject{gene.info} database, we can figure out which genes were actually selected. <>= gene.info[sel2, 3:6] @ \section{Power Computations} The power depends on the number of genes (G), the number of healthy samples (N1), the number of cancer samples (N2), the target specificity (psi), the confidence (conf = 1 - FWER), and the sensitivity that you want to be able to detect (phi). Here is an example using the sizes from the prostate cancer data set, showing that we have more than $70\%$ power to detect a marker with $40\%$ sensitivity. <>= tailRankPower(2000, N1=41, N2=71, psi=0.95, phi=0.40, conf=0.95) @ The next example shows that the power decreases to $43\%$ when using the same number of samples with a whole genome array containing $40000$ gene probes. (This was the size of the full study from which these $2000$ genes were randomly selected.) <>= tailRankPower(40000, N1=41, N2=71, psi=0.95, phi=0.40, conf=0.95) @ We can determine the power for a variety of cancer sample sizes, keeping everything else the same <>= tailRankPower(40000, N1=41, N2=seq(40,100,by=10), psi=0.95, phi=0.40, conf=0.95) @ More generally, we can create power tables using the \Rfunction{biomarkerPowerTable} function. Individual tables have rows labeled by the number of ``cancer'' samples and columns labeled by the desired sensitivity; the entries in the table show the power to detect that level of sensitivity when using that many samples. <>= biomarkerPowerTable(G=c(10000, 20000, 40000), N1=41, N2=seq(40, 100, by=10), conf=0.95, psi=0.95, phi=seq(0.30, 0.50, by=0.05)) @ \section{References} [1] Lapointe J et al. (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. \emph{Proc Natl Acad Sci U S A}, 101, 811--816. \end{document}