% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- % \VignetteIndexEntry{An R Package for Ordinal Response Prediction in High-dimensional Data Settings} % \VignetteDepends{glmpath, glmpathcr} % \VignetteKeyword{models} \documentclass[article, shortclass, nojss]{jss} \usepackage{amsmath, graphicx} \title{\pkg{glmpathcr}: An \proglang{R} Package for Ordinal Response Prediction in High-dimensional Data Settings} \author{Kellie J. Archer\\The Ohio State University} \newcommand{\bet}{\mbox{\boldmath $\beta$}} \Plainauthor{Kellie J. Archer} \Plaintitle{glmpathcr: An R Package for Ordinal Response Prediction in High-dimensional Data Settings} \Shorttitle{An \proglang{R} Package for Penalized Ordinal Response Models} \Abstract{This paper describes an \proglang{R} package, \pkg{glmpathcr}, that provides a function for fitting a penalized continuation ratio model when interest lies in predicting an ordinal response. The function, \code{glmpathcr} uses the coordinate descent fitting algorithm as implemented in \code{glmpath} and described by \citep{Parkpaper}. Methods for extracting all estimated coefficients, extracting non-zero coefficient estimates, obtaining the predicted class, and obtaining the class-specific fitted probabilities have been implemented. Additionally, generic methods from \pkg{glmpath} including \code{summary}, \code{print}, and \code{plot} can be applied to a \code{glmpathcr} object. } \Keywords{ordinal response, penalized models, LASSO, L$_1$ constraint, \proglang{R}} \Plainkeywords{ordinal response, penalized models, LASSO, L1 constraint, R} \Address{ Kellie J. Archer\\ Division of Biostatistics\\ College of Public Health\\ The Ohio State University\\ 1841 Neil Ave.\\ Columbus, OH 43210\\ E-mail: \email{archer.43@osu.edu}\\ URL: \url{https://cph.osu.edu/people/karcher} } \SweaveOpts{echo=FALSE} \usepackage{a4wide} \begin{document} %\maketitle \section{Introduction} High-throughput genomic experiments are frequently conducted for the purpose of examining whether genes are predictive of or significantly associated with phenotype. In many biomedical settings where histopathological or health status data are collected, phenotypic variables are recorded on an ordinal scale. Nevertheless, most often investigators neglect the ordinality of the phenotypic data and rather dichotomize the ordinal class than apply statistical methods suitable for two-class comparisons and predictions. This tendency to analyze ordinal data using dichotomous class methodologies may be due to the lack of available statistical methods and software for modeling an ordinal response in the presence of a high-dimensional covariate space. The approach of collapsing ordinal categories may neglect important information in the study \citep{Armstrong}. A variety of statistical modeling procedures, namely, proportional odds, adjacent category, stereotype logit, and continuation ratio models can be used to predict an ordinal response. In this paper, we focus attention to the continuation ratio model because its likelihood can be easily re-expressed such that existing software can be readily adapted and used for model fitting. Suppose for each observation, $i=1,\ldots,n$, the response $Y_i$ belongs to one ordinal class $k=1,\ldots,K$ and $\mathbf{x}_i$ represents a $p$-length vector of covariates.The backward formulation of the continuation ratio models the logit as \begin{equation} \texttt{logit}\left(P(Y=k|Y\le k, \mathbf{X}=\mathbf{x})\right)=\alpha_k+\bet_k^T\mathbf{x} \end{equation} whereas the forward formulation models the logit as \begin{equation} \texttt{logit}\left(P(Y=k|Y\ge k, \mathbf{X}=\mathbf{x})\right)=\alpha_k+\bet_k^T\mathbf{x}. \end{equation} Rather than describe both formulations in detail, here we present the backward formulation, which is commonly used when progression through disease states from none, mild, moderate, severe is represented by increasing integer values, and interest lies in estimating the odds of more severe disease compared to less severe disease \citep{Bender}. Therefore for $i=1,\ldots,n$ we can construct a vector $\mathbf{y}_i$ from $Y_i$ to represent ordinal class membership, such that $\mathbf{y}_i=(y_{i1}, y_{i2},\ldots, y_{iK})^T$, where $y_{ik}=1$ if the response is in category $k$ and 0 otherwise, so that $n_i=\sum_{k=1}^Ky_{ik}=1$. Using the logit link, the equation representing the conditional probability for class $k$ is \begin{equation} \delta_k(\mathbf{x})=P(Y = k |Y \le k, \mathbf{X}=\mathbf{x})=\frac{\exp(\alpha_k+\bet^T\mathbf{x})}{1+\exp(\alpha_k+\bet^T\mathbf{x})}. \end{equation} The likelihood for the continuation ratio model is then the product of conditionally independent binomial terms \citep{Cox}, which is given by \begin{equation} L(\bet|\mathbf{y},\mathbf{x})=\prod_{i=1}^n\delta_2^{y_{i2}}(1-\delta_2)^{1-\sum_{k=2}^Ky_{ik}}\times\cdots\times\delta_{K}^{y_{iK}}(1-\delta_{K})^{1-y_{iK}}\label{crlikelihood} \end{equation} where here we have simplified our notation by not explicitly including the dependence of the conditional probability $\delta_k$ on $\mathbf{x}$. Further, simplifying our notation to let $\bet$ represent the vector containing both the thresholds $(\alpha_2,\ldots,\alpha_{K})$ and the log odds $(\beta_1,\ldots,\beta_p)$ for all $K-1$ logits, the full parameter vector is \begin{equation} \bet=(\alpha_2,\beta_{21},\beta_{22},\ldots,\beta_{2p},\ldots,\alpha_{K},\beta_{K,1},\beta_{K,2},\ldots,\beta_{K,p})^T \end{equation} which is of length $(K-1)(p+1)$. As can be seen from equation~\ref{crlikelihood}, the likelihood can be factored into $K-1$ independent likelihoods, so that maximization of the independent likelihoods will lead to an overall maximum likelihood estimate for all terms in the model \citep{Bender}. A model consisting of $K-1$ different $\bet$ vectors may be overparameterized so to simplify, one commonly fits a constrained continuation model, which includes the $K-1$ thresholds $(\alpha_2,\ldots,\alpha_{K})$ and one common set of $p$ slope parameters, $(\beta_1,\ldots,\beta_p)$. To fit a constrained continuation ratio model, the original dataset can be restructured by forming $K-1$ subsets, where for classes $k=2,\ldots,K$, the subset contains those observations in the original dataset up to class $k$. Additionally, for the $k^{th}$ subset, the outcome is dichotomized as $y=1$ if the ordinal class is $k$ and $y=0$ otherwise. Furthermore, an indicator is constructed for each subset representing subset membership. Thereafter the $K-1$ subsets are appended to form the restructured dataset, which represents the $K-1$ conditionally independent datasets in equation~\ref{crlikelihood}. Applying a logistic regression model to this restructured dataset yields an L$_1$ constrained continuation ratio model. \section{Penalized Models} For datasets where the number of covariates $p$ exceeds the sample size $n$, the backwards stepwise procedure cannot be undertaken. Furthermore, for any problem using a forward selection procedure the discrete variable inclusion process can exhibit high variance. Moreover, for high-dimensional covariate spaces, the best subset procedure is computationally prohibitive. Two penalized methods, ridge and L$_1$ penalization, places a penalty on a function of the coefficient estimates, thereby permitting a model fit even for high-dimensional data \cite{Tibs1996, Tibs1997}. A generalization of these penalized models can be expressed as, \begin{equation} \tilde{\beta}=\arg \min_\beta(\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^px_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|^q) \end{equation} for $q\ge0$. When $q=1$ we have the an L$_1$ penalized model, when $q=2$ we have ridge regression. Values of $q\in (1,2)$ provide a compromise between the L$_1$ and ridge penalized models. Because when $q>1$ coefficients are no longer set exactly equal to 0, the elastic net penalty was introduced \begin{equation} \lambda\sum_{j=1}^p(\alpha\beta_j^2+(1-\alpha)|\beta_j|). \end{equation} \section{Implementation} The \pkg{glmpathcr} package was written in the \proglang{R} programming environment \citep{RTeam} and depends on the \pkg{glmpath} package \citep{Park}. Similar to the \pkg{Design} package which includes a function \code{cr.setup} for restructuring a dataset for fitting a forward continuation ratio model, in this package the model is fit by restructuring the dataset then passing the restructured dataset to a penalized logistic regression fitting function. However, unlike \code{cr.setup} which produces an object of class \code{list} from which the response and restructured independent variables are extracted and passed to a model fitting algorithm, in the \pkg{glmpathcr} package the restructuring functions are transparent to the user. Specifically, the \pkg{glmpathcr} package fits either a forward or backward (default) penalized constrained continuation ratio model by specification of \code{method="forward"} in the \code{glmpathcr} call. The \code{glmpathcr} function restructures the dataset to represent the $K-1$ conditionally independent likelihoods and then fits the penalized continuation ratio model using the \pkg{glmpath} framework. Therefore, the predictor-corrector fitting procedure used by the \code{glmpath} function in the \pkg{glmpath} package is used in fitting the penalized continuation ratio model when invoking \code{glmpathcr}. This allows fitting a penalized model for situations where the number of covariates $p$ exceed the sample size $n$. In addition, methods for extracting the best fitting model from the path using AIC and BIC criteria, obtaining predicted class and fitted class probabilities, and returning coefficient estimates were written in addition to adapting the \code{print}, \code{summary}, and \code{plot} methods from \pkg{glmpath} for a \code{glmpathcr} object. \section{Example} The \pkg{glmpathcr} package includes a filtered microarray dataset \code{diabetes} in which asymptomatic males not previously diagnosed with Type II diabetes were enrolled and subsequently were cross-classified as either normal controls (N=8), having impaired fasting glucose (N=7), or as Type II diabetics (N=9) based on a fasting glucose intolerance test. From the code below we can see that the classification variable is stored as \code{y} in the first column of the \code{diabetes} \code{data.frame}; all subsequent columns are the 11,066 Illumina probes having no negative expression values. In fitting the model we can extract the covariates into an object \code{x} and the ordinal outcome into the object \code{y}. The code for fitting a backward (default) continuation ratio model is given by <>= library(glmpathcr) data(diabetes) dim(diabetes) names(diabetes)[1:10] summary(diabetes$y) x <- diabetes[, 2:dim(diabetes)[2]] y <- diabetes$y fit <- glmpathcr(x,y) @ As with \code{glmpath} model objects, methods such as \code{summary} and \code{plot} can be applied to \code{glmpathcr} model objects, which are helpful for selecting the step at which to select the final model from the solution path. <>= summary(fit) plot(fit, xvar = "step", type = "bic") @ \begin{figure}[htbp] \begin{center} <>= plot(fit, xvar = "step", type = "bic") @ \caption{Plot of regularization path for \code{glmpathcr} object using simulated dataset, \code{data}.} \end{center} \end{figure} Note that when plotting, the horizontal axis can be \code{norm}, \code{lambda}, or \code{step}, however extractor functions for \code{glmpathcr} generally require the step to be selected, so we have selected \code{xvar = "step"} in this example. The vertical axis can be coefficients, aic or bic. As one can see, there is a multitude of models fit from one call to \code{glmpathcr}. To faciliate extraction of best fitting models using commonly used criterion, the \code{model.select} function can be used. The \code{model.select} function extracts the best fitting model from the solution path, where the \code{which} parameter allows one to select either AIC or by default, BIC. <>= BIC.step <- model.select(fit) BIC.step AIC.step <- model.select(fit, which = "AIC") AIC.step @ In this example, the minimum BIC corresponds to a 8 degree of freedom model. The \code{coef} function returns all estimated coefficients for a \code{glmpathcr} fitted model, where the model selected is indicated by step number, \code{s}. The \code{nonzero.coef} function returns only those non-zero coefficient estimates for a selected model. <>= coefficients<-coef(fit, s=BIC.step) sum(coefficients!=0) nonzero.coef(fit, s=BIC.step) @ Note that the \code{glmpathcr} function fits a penalized constrained continuation ratio model; therefore for $K$ classes, there will be $K-1$ intercepts representing the cutpoints between adjacent classes. In this package, the nomenclature for these cutpoints is to use \lq\lq cp\textit{k}\rq\rq\ where $k=1,\ldots,K-1$. In this dataset, $K=3$ so the intercepts are \code{cp1} and \code{cp2} with \code{Intercept} being an offset. The probe having the largest absolute coefficient estimate, \code{ILMN_1759232}, corresponds to the insulin receptor substrate 1 (IRS1) gene which is biologically meaningful. Continuation ratio models predicts conditional probabilities so a new method to extract the fitted probabilities and predicted class was created. The \code{predict} and \code{fitted} functions are equivalent, and return either the predicted class or the fitted probabilities from the penalized continuation ratio model for a \code{glmpathcr} object. The user is required to supply the fitted \code{glmpathcr} model object, a data matrix \code{newx} that is either the same as the training data or an independent dataset having the same number and order of covariates as the training data, a vector \code{newy} that provides the class labels of the ordinal response. These functions extract the fitted values for the best fitting model using the BIC criteria by default, which can be changed to extracting the best fitting AIC model by supplying \code{which="AIC"}. By default, the predicted class is output. If one desired the fitted class-specific probabilities from the model, the \code{type="probs"} argument should be supplied. <>= pred <- predict(fit) table(pred, y) pred <- predict(fit, type="probs") pred @ For illustrative purposes, a forward continuation ratio model can be fit using the syntax <>= fit <- glmpathcr(x, y, method="forward") @ As before, the parameter estimates corresponding to the model attaining the minimum BIC can be extracted using the following code. <>= coefficients<-coef(fit, s=BIC.step) nonzero.coef(fit, s=BIC.step) @ and the predicted class can be obtained using <>= pred <- predict(fit) table(pred, y) @ \section*{Summary} Herein we have described the \pkg{glmpathcr} package which works in conjunction with the \pkg{glmpath} package in the \proglang{R} programming environment. The package provides methods for fitting either a forward or backward penalized continuation ratio model. Moreover, the likelihood-based penalized continuation ratios models have been demonstrated to have good performance when applied to microarray gene expression datasets \citep{Archer} in comparison to corresponding penalized Bayesian continuation ratio models \citep{Kiiveri}. A similar package, \pkg{glmnetcr}, which uses the \pkg{glmnet} fitting algorithm for fitting a penalized constrained continuation ratio model has also been developed and is available for download from the Comprehensive R Archive Network. Functions for extracting coefficients, extracting non-zero coefficients, and obtaining fitted probabilities and predicted class in the \pkg{glmnetcr} package follow those in \pkg{glmpathcr} and both packages have similar performance \citep{Archer}. Therefore either the \pkg{glmpathcr} or \pkg{glmnetcr} package should be helpful when predicting an ordinal response for datasets where the number of covariates exceeds the number of available samples. \section*{Acknowledgments} This research was supported by the National Institute of Library Medicine R03LM009347 and R01LM011169. \bibliography{glmpathRefs} \end{document}