penalizedclr: an R package for penalized conditional logistic regression

Introduction

The R package penalizedclr provides an implementation of the penalized logistic regression model that can be used in the analysis of matched case-control studies. The implementation allows to apply different penalties to different blocks of covariates, and is therefore particularly useful in the presence of multi-source heterogenous data, such as multiple layers of omics measurements. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection to allow for variable selection in the considered regression model.

Installation

You can install the released version of penalizedclr from CRAN with:

install.packages("penalizedclr")

And the development version from GitHub with:

library(devtools)
install_github("veradjordjilovic/penalizedclr")

library(penalizedclr)

The setting

Assume that we have $$K$$ independent matched case-control pairs $$(Y_{ki}, X_{ki})$$, where $$Y_{ki}$$, $$k=1,\ldots,K;$$ $$i=1,2,$$ is a binary variable indicating case control status (1 if case, 0 if control) and $$X_{ki}$$ is the associated $$p$$-dimensional vector of covariates. The conditional logistic regression models the probability of being a case given that the observation belongs to the $$k$$-th pair as: ${\mathrm {logit}}\left[P(Y_{ki}=1 \mid S=k)\right] = \beta_{0k} + \beta^{T}X_{ki}, \quad k\in\left\{1,\ldots, K\right\}, i\in\left\{1,2\right\}$ where $$S$$ is the matched pair id, $$\beta_{0k}$$ is the pair specific intercept and $$\beta=(\beta_1, \ldots,\beta_p)^T$$ is a $$p$$-dimensional vector of fixed effects.

In the present package we:

• estimate $$\beta$$ in the high dimensional setting in which the number of covariates $$p$$ is much higher than the number of pairs $$K$$. We consider a penalized conditional logistic regression, which adds a penalty to the conditional log likelihood. Motivated by current medical applications considering clinical and molecular data, we allow $$X_{ki}$$ to be a merge of heteregenous data sources;

• perform stability selection to identify important variables, that is, variables for which $$\beta_j\neq 0$$, $$j\in\left\{1,\ldots,p\right\}$$.

Examples

In this section we provide examples of how to fit a penalized conditional regression model with source-specific penalty parameters and how to perform variable selection with penalizedclr.

Initial settings and libraries to be loaded:

set.seed(1234)
library(tidyverse)

Data generation

We generate a simple multi-source data set, with two groups of covariates containing 12 and 40 variables, respectively. As specified above, each case is matched to a control, and the probability of being a case in each stratum (case-control pair) is obtained from the linear predictor. An intercept is generated independently for each stratum.

# two groups of predictors
p <- c(12, 40)

# percentage and number of non-null variables
p_nz <- c(0.5, 0.2)
m_nz <- round(p*p_nz, 0)

# number of different strata (case-control pairs)
K <- 100

# number of cases and controls in each stratum (not necessarily 1:1 matching,
# other designs are also allowed)
n_cases <- 1
n_ctrl <- 1

# generating covariates
X = cbind(matrix(rnorm(p[1] * K * (n_cases + n_ctrl), 0, 1), ncol = p[1]),
matrix(rnorm(p[2] * K * (n_cases + n_ctrl), 0, 4), ncol = p[2]))

# coefficients
beta <- as.matrix(c(rnorm(m_nz[1], 4, 1),
rep(0, p[1] - m_nz[1]),
rnorm(m_nz[2], 2, 0.8),
rep(0, p[2] - m_nz[2])), ncol = 1)

# stratum membership
stratum <- rep(1:K, each= n_cases+n_ctrl)

# linear predictor
lin_pred <-  X %*% beta

prob_case <- exp(lin_pred) / (1 + exp(lin_pred))

# generate the response

Y <- rep(0, length(stratum))

data_sim <- as_tibble(data.frame(stratum = stratum,
probability = prob_case,
obs_id = 1 : length(stratum)))
data_sim_cases <- data_sim %>%
group_by(stratum)%>%
sample_n(n_cases, weight = probability)

Y[data_sim_cases$obs_id] <- 1 Penalized conditional logistic regression: model fit The function penalized.clr fits a penalized conditional logistic regression model with different penalties for different blocks of covariates. The L1 penalty parameter lambda (a numeric vector of the length equal to the number of blocks) is specified by the user. This code illustrates how to fit penalized.clr with penalty parameters specified by the user. Here Y is the response vector, X is the multi-source matrix of covariates, stratum is the vector of ids of the matching pairs, and p is the vector of block dimensions. It has the same dimension as the vector of penalty parameters lambda. Penalty parameters are thus specified for each data source, i.e., a block of covariates. It is possible to standardize the variables by setting standardize = TRUE (TRUE by default). fit1 <- penalized.clr(response = Y, penalized = X, stratum = stratum, lambda = c(1,4), p = p, standardize = TRUE) The function returns a list with elements: 1. penalized - Regression coefficients for the penalized covariates. 2. unpenalized - Regression coefficients for the unpenalized covariates. 3. converged - Whether the fitting process was judged to be converged. 4. lambda - The tuning parameter for L1 used. 5. alpha - The elastic net mixing parameter used. 6. Y - The response vector saved as a survival object used internally. For instance, fit1$penalized is a numeric vector containing the regression coefficients for the penalized covariates. In this simple example, we can compare estimated coefficients with the true coefficients (the ones that generated our data), by constructing a 2 x 2 contingency table cross-tabulating true non-zero and estimated non-zero coefficients:

str(fit1)
#> List of 6
#>  $penalized : num [1:52] 0.251 0.212 0.584 0.402 0.613 ... #>$ unpenalized: num(0)
#>  $converged : logi TRUE #>$ lambda     : num [1:2] 1 4
#>  $alpha : num 1 #>$ Y          : 'Surv' num [1:200, 1:2] 1+ 1  1+ 1  1  1+ 1+ 1  1  1+ ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$: NULL #> .. ..$ : chr [1:2] "time" "status"
#>   ..- attr(*, "type")= chr "right"
nonzero_index <- (beta != 0) * 1 #index of nonzero coefficients
table(fitted = (fit1$penalized != 0) * 1, nonzero_index) #> nonzero_index #> fitted 0 1 #> 0 19 2 #> 1 19 12 The package penalizedclr is not limited to L1, i.e. lasso, penalty. While L1 penalty is more suited for variable selection, in the presence of highly correlated covariates, it can be useful to add small amount of L2 penalty. The two can be combined in an elastic net framework by specifying the mixing parameter alpha that can assume values between 0 and 1. Default alpha=1 gives lasso. fit2 <- penalized.clr(response = Y, penalized = X, stratum = stratum, lambda = c(1,4),p = p, standardize = TRUE, alpha = 0.6) Penalized conditional logistic regression: stability selection The function stable.clr performs stability selection (Meinshausen and Bühlmann 2010 and Shah and Samworth 2013) for variable selection in penalized conditional logistic regression. For details on stability selection, we refer to the original publications. Briefly, a set of L1 penalties is considered, and for each considered value of the penalty, $$2B$$ random subsamples of $$\lfloor K/2 \rfloor$$ matched pairs are taken and a penalized model is estimated. For each covariate and penalty value, selection probability is estimated as the proportion of estimated models in which the associated coefficient estimate is different from zero. Finally, the overall selection probability of a variable is obtained by taking the maximum selection probability over all considered penalty values. Parameter $$B$$ is set to 100 by default, but can be changed by the user. Note that this choice will have an impact on the computation time, and higher values of $$B$$ will lead to a slower computation. The function returns a list containing the selection probabilities for each covariate, i.e. the proportion of estimated models in which the associated coefficient estimate is different from zero. The user can then set a threshold for selection probability (values ranging from 0.55 to 0.9 are recommended) and obtain the set of selected covariates. Case 1: Penalty parameters specified by the user The following code performs stability selection when all covariates are considered as a single block (a single data source) and a sequence of L1 penalties contains 3 values. stable1 <- stable.clr(response = Y, B = 50, penalized = X, stratum = stratum, lambda.seq = c(1, 4, 6), parallel = TRUE) The function returns a list with two elements: 1. Pistab a numeric vector giving estimates of selection probabilities for each penalized covariate. 2. lambda.seq a sequence of L1 penalties used. To inspect the results, we can, for instance, print the covariates with selection probability higher than 0.6: 2, 3, 4, 5, 13, 16, 17, 18, 39, 47. We can inspect the histogram of selection probabilities. hist(stable1$Pistab, xlab = "Selection probability", ylab="Frequency")

and compute variable selection classification accuracy:

(tab1 <- table(stable = (stable1$P>0.65) * 1, nonzero_index)) #> nonzero_index #> stable 0 1 #> 0 37 7 #> 1 1 7 (tab1[1,1] + tab1[2,2])/(sum(tab1)) #> [1] 0.8461538 This histrogram shows null variables in blue and signal variables in red. # plotting selection probabilities for true non-zero (red) and zero (blue) coefficients s_prob_nonzero <- cut(stable1$P[nonzero_index == 1],
breaks = seq(0,1, by = 0.1), ordered_result = T)
s_prob_zero <- cut(stable1$P[nonzero_index == 0], breaks = seq(0,1, by = 0.1), ordered_result = T) barplot(table(s_prob_nonzero), col = 2) barplot(table(s_prob_zero), add =T, col = 4 ) Case 2: Multiple data sources This code implements stability selection while taking into account the block structure of covariates ($$p$$). To achieve this, we use the function stable.clr.g which extends stable.clr to allow for blocks (groups) of covariates. In this case, the function takes as an argument lambda.list, a list of numeric vectors of L1 penalties. Each vector has the length equal to the number of blocks, with the $$i$$th element giving the penalty to be applied to the $$i$$th data source. stable2 <- penalizedclr::stable.clr.g(response = Y, p = p, standardize = TRUE, penalized = X, stratum = stratum, lambda.list = list(c(4,1), c(1,4))) The selection probability histogram in this case. Case 3: Data adaptive penalty parameters It is possible to obtain a data adaptive sequence of L1 penalty parameters via the functions default.pf, that searches for the suitable vector of penalty factors (up to a multiplicative constant), and find.default.lambda that searches for the optimal L1 penalty for a given vector of penalty factors. The first function, follows the procedure proposed by Gerhard Schulze in his Master Thesis (Clinical Outcome Prediction based on Multi-Omics Data: Extension of IPF-LASSO). In short, in the first step, a tentative conditional logistic regression model is fit, either to each covariates block separately (type.step1 = "sep") or to all blocks jointly (type.step1 = "comb"). The penalty factor for each block is then obtained as the inverse of the mean of the absolute values of the estimated coefficients of that block. In this way, blocks that contain covariates with large estimated coefficients will be penalized less. If all estimated coefficients pertaining to a block are zero, the function returns a message. The function returns a list with penalty factors corresponding to different blocks. The second function relies on the cv.clogitL1 function of the clogitL1 package to perform cross-validation and, for a given vector of penalty factors, for instance obtained from default.pf returns the value of lambda that achieves the minimal cross-validated deviance, see documentation of package clogitL1 for more details.  (pf <- default.pf(response = Y, stratum = stratum, penalized = X, nfolds = 10, alpha = 0.3, standardize = TRUE, p = p, type.step1 = "comb")) #>$pf
#> [1] 32.70923 42.73989

(lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p,  pf.list  = pf))
#> [[1]]
#> [1] 0.1375511
stable3 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = list(c(pf[[1]][1]*as.numeric(lambda), pf[[1]][2]*as.numeric(lambda))))

The histogram of selection probabilities with true signal variables in red and null variables in blue.

The user can augment the data adaptive vector of penalty factors with other vectors.

alt.pf.list<- list(c(1,4), c(2,4), c(4,1), c(4,2), pf\$pf)
alt.lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p,  pf.list  = alt.pf.list)
lambda.matrix <- mapply(function (x,y) x*y, lambda, alt.pf.list)

lambda.list <- lapply(seq_len(ncol(lambda.matrix)), function(i) lambda.matrix[,i])
stable4 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = lambda.list )

The histogram of selection probabilities:

References

1. Boulesteix, A. L., De Bin, R., Jiang, X., & Fuchs, M. (2017). IPF-LASSO: Integrative-penalized regression with penalty factors for prediction based on multi-omics data. Computational and mathematical methods in medicine, 2017.

2. Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.

3. Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80.