The R package `penalizedclr` provides an implementation of the
penalized logistic regression model that can be used in the analysis of
matched case-control studies. The implementation allows to apply
different penalties to different blocks of covariates, and is therefore
particularly useful in the presence of multi-source heterogenous data,
such as multiple layers of omics measurements. Both L1 and L2 penalties
are implemented. Additionally, the package implements stability
selection to allow for variable selection in the considered regression
model.

You can install the released version of `penalizedclr` from CRAN with:

And the development version from GitHub with:

Load the package with:

Assume that we have \(K\) independent matched case-control pairs \((Y_{ki}, X_{ki})\), where \(Y_{ki}\), \(k=1,\ldots,K;\) \(i=1,2,\) is a binary variable indicating case control status (1 if case, 0 if control) and \(X_{ki}\) is the associated \(p\)-dimensional vector of covariates. The conditional logistic regression models the probability of being a case given that the observation belongs to the \(k\)-th pair as: \[ {\mathrm {logit}}\left[P(Y_{ki}=1 \mid S=k)\right] = \beta_{0k} + \beta^{T}X_{ki}, \quad k\in\left\{1,\ldots, K\right\}, i\in\left\{1,2\right\} \] where \(S\) is the matched pair id, \(\beta_{0k}\) is the pair specific intercept and \(\beta=(\beta_1, \ldots,\beta_p)^T\) is a \(p\)-dimensional vector of fixed effects.

In the present package we:

estimate \(\beta\) in the high dimensional setting in which the number of covariates \(p\) is much higher than the number of pairs \(K\). We consider a penalized conditional logistic regression, which adds a penalty to the conditional log likelihood. Motivated by current medical applications considering clinical and molecular data, we allow \(X_{ki}\) to be a merge of heteregenous data sources;

perform stability selection to identify important variables, that is, variables for which \(\beta_j\neq 0\), \(j\in\left\{1,\ldots,p\right\}\).

In this section we provide examples of how to fit a penalized
conditional regression model with source-specific penalty parameters and
how to perform variable selection with `penalizedclr`.

Initial settings and libraries to be loaded:

We generate a simple multi-source data set, with two groups of covariates containing 12 and 40 variables, respectively. As specified above, each case is matched to a control, and the probability of being a case in each stratum (case-control pair) is obtained from the linear predictor. An intercept is generated independently for each stratum.

```
# two groups of predictors
p <- c(12, 40)
# percentage and number of non-null variables
p_nz <- c(0.5, 0.2)
m_nz <- round(p*p_nz, 0)
# number of different strata (case-control pairs)
K <- 100
# number of cases and controls in each stratum (not necessarily 1:1 matching,
# other designs are also allowed)
n_cases <- 1
n_ctrl <- 1
# generating covariates
X = cbind(matrix(rnorm(p[1] * K * (n_cases + n_ctrl), 0, 1), ncol = p[1]),
matrix(rnorm(p[2] * K * (n_cases + n_ctrl), 0, 4), ncol = p[2]))
# coefficients
beta <- as.matrix(c(rnorm(m_nz[1], 4, 1),
rep(0, p[1] - m_nz[1]),
rnorm(m_nz[2], 2, 0.8),
rep(0, p[2] - m_nz[2])), ncol = 1)
# stratum membership
stratum <- rep(1:K, each= n_cases+n_ctrl)
# linear predictor
lin_pred <- X %*% beta
prob_case <- exp(lin_pred) / (1 + exp(lin_pred))
# generate the response
Y <- rep(0, length(stratum))
data_sim <- as_tibble(data.frame(stratum = stratum,
probability = prob_case,
obs_id = 1 : length(stratum)))
data_sim_cases <- data_sim %>%
group_by(stratum)%>%
sample_n(n_cases, weight = probability)
Y[data_sim_cases$obs_id] <- 1
```

The function `penalized.clr` fits a penalized conditional
logistic regression model with different penalties for different blocks
of covariates. The L1 penalty parameter `lambda`

(a numeric
vector of the length equal to the number of blocks) is specified by the
user.

This code illustrates how to fit `penalized.clr` with penalty
parameters specified by the user. Here `Y` is the response
vector, `X` is the multi-source matrix of covariates,
`stratum` is the vector of ids of the matching pairs, and
`p` is the vector of block dimensions. It has the same dimension
as the vector of penalty parameters `lambda`. Penalty parameters
are thus specified for each data source, i.e., a block of covariates. It
is possible to standardize the variables by setting `standardize =
TRUE` (`TRUE` by default).

```
fit1 <- penalized.clr(response = Y, penalized = X, stratum = stratum,
lambda = c(1,4), p = p, standardize = TRUE)
```

The function returns a list with elements:

`penalized`- Regression coefficients for the penalized covariates.`unpenalized`- Regression coefficients for the unpenalized covariates.`converged`- Whether the fitting process was judged to be converged.`lambda`- The tuning parameter for L1 used.`alpha`- The elastic net mixing parameter used.`Y`- The response vector saved as a survival object used internally.

For instance, `fit1$penalized`

is a numeric vector
containing the regression coefficients for the penalized covariates. In
this simple example, we can compare estimated coefficients with the true
coefficients (the ones that generated our data), by constructing a
`2 x 2`

contingency table cross-tabulating true non-zero and
estimated non-zero coefficients:

```
str(fit1)
#> List of 6
#> $ penalized : num [1:52] 0.251 0.212 0.584 0.402 0.613 ...
#> $ unpenalized: num(0)
#> $ converged : logi TRUE
#> $ lambda : num [1:2] 1 4
#> $ alpha : num 1
#> $ Y : 'Surv' num [1:200, 1:2] 1+ 1 1+ 1 1 1+ 1+ 1 1 1+ ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:2] "time" "status"
#> ..- attr(*, "type")= chr "right"
nonzero_index <- (beta != 0) * 1 #index of nonzero coefficients
table(fitted = (fit1$penalized != 0) * 1, nonzero_index)
#> nonzero_index
#> fitted 0 1
#> 0 19 2
#> 1 19 12
```

The package `penalizedclr` is not limited to L1, i.e. lasso,
penalty. While L1 penalty is more suited for variable selection, in the
presence of highly correlated covariates, it can be useful to add small
amount of L2 penalty. The two can be combined in an elastic net
framework by specifying the mixing parameter `alpha`

that can
assume values between 0 and 1. Default `alpha=1`

gives
lasso.

The function `stable.clr` performs stability selection
(Meinshausen and Bühlmann 2010 and Shah and Samworth 2013) for variable
selection in penalized conditional logistic regression. For details on
stability selection, we refer to the original publications. Briefly, a
set of L1 penalties is considered, and for each considered value of the
penalty, \(2B\) random subsamples of
\(\lfloor K/2 \rfloor\) matched pairs
are taken and a penalized model is estimated. For each covariate and
penalty value, selection probability is estimated as the proportion of
estimated models in which the associated coefficient estimate is
different from zero. Finally, the overall selection probability of a
variable is obtained by taking the maximum selection probability over
all considered penalty values.

Parameter \(B\) is set to 100 by default, but can be changed by the user. Note that this choice will have an impact on the computation time, and higher values of \(B\) will lead to a slower computation.

The function returns a list containing the selection probabilities for each covariate, i.e. the proportion of estimated models in which the associated coefficient estimate is different from zero. The user can then set a threshold for selection probability (values ranging from 0.55 to 0.9 are recommended) and obtain the set of selected covariates.

The following code performs stability selection when all covariates are considered as a single block (a single data source) and a sequence of L1 penalties contains 3 values.

```
stable1 <- stable.clr(response = Y, B = 50,
penalized = X, stratum = stratum,
lambda.seq = c(1, 4, 6), parallel = TRUE)
```

The function returns a list with two elements:

`Pistab`a numeric vector giving estimates of selection probabilities for each penalized covariate.`lambda.seq`a sequence of L1 penalties used.

To inspect the results, we can, for instance, print the covariates with selection probability higher than 0.6: 2, 3, 4, 5, 13, 16, 17, 18, 39, 47.

We can inspect the histogram of selection probabilities.

and compute variable selection classification accuracy:

```
(tab1 <- table(stable = (stable1$P>0.65) * 1, nonzero_index))
#> nonzero_index
#> stable 0 1
#> 0 37 7
#> 1 1 7
(tab1[1,1] + tab1[2,2])/(sum(tab1))
#> [1] 0.8461538
```

This histrogram shows null variables in blue and signal variables in red.

```
# plotting selection probabilities for true non-zero (red) and zero (blue) coefficients
s_prob_nonzero <- cut(stable1$P[nonzero_index == 1],
breaks = seq(0,1, by = 0.1), ordered_result = T)
s_prob_zero <- cut(stable1$P[nonzero_index == 0],
breaks = seq(0,1, by = 0.1), ordered_result = T)
barplot(table(s_prob_nonzero), col = 2)
barplot(table(s_prob_zero), add =T, col = 4 )
```

This code implements stability selection while taking into account
the block structure of covariates (\(p\)). To achieve this, we use the function
`stable.clr.g` which extends `stable.clr` to allow for
blocks (groups) of covariates. In this case, the function takes as an
argument `lambda.list`

, a list of numeric vectors of L1
penalties. Each vector has the length equal to the number of blocks,
with the \(i\)th element giving the
penalty to be applied to the \(i\)th
data source.

```
stable2 <- penalizedclr::stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = list(c(4,1), c(1,4)))
```

The selection probability histogram in this case.

It is possible to obtain a data adaptive sequence of L1 penalty
parameters via the functions `default.pf`, that searches for the
suitable vector of penalty factors (up to a multiplicative constant),
and `find.default.lambda` that searches for the optimal L1
penalty for a given vector of penalty factors.

The first function, follows the procedure proposed by Gerhard Schulze
in his Master Thesis (Clinical Outcome Prediction based on Multi-Omics
Data: Extension of IPF-LASSO). In short, in the first step, a tentative
conditional logistic regression model is fit, either to each covariates
block separately (`type.step1 = "sep"`

) or to all blocks
jointly (`type.step1 = "comb"`

). The penalty factor for each
block is then obtained as the inverse of the mean of the absolute values
of the estimated coefficients of that block. In this way, blocks that
contain covariates with large estimated coefficients will be penalized
less. If all estimated coefficients pertaining to a block are zero, the
function returns a message. The function returns a list with penalty
factors corresponding to different blocks.

The second function relies on the `cv.clogitL1`

function
of the `clogitL1`

package to perform cross-validation and,
for a given vector of penalty factors, for instance obtained from
`default.pf`

returns the value of `lambda`

that
achieves the minimal cross-validated deviance, see documentation of
package `clogitL1`

for more details.

```
(pf <- default.pf(response = Y, stratum = stratum, penalized = X,
nfolds = 10, alpha = 0.3,
standardize = TRUE, p = p, type.step1 = "comb"))
#> $pf
#> [1] 32.70923 42.73989
(lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p, pf.list = pf))
#> [[1]]
#> [1] 0.1375511
```

```
stable3 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = list(c(pf[[1]][1]*as.numeric(lambda), pf[[1]][2]*as.numeric(lambda))))
```

The histogram of selection probabilities with true signal variables in red and null variables in blue.

The user can augment the data adaptive vector of penalty factors with other vectors.

```
alt.pf.list<- list(c(1,4), c(2,4), c(4,1), c(4,2), pf$pf)
alt.lambda <- find.default.lambda(response = Y, penalized = X,
standardize = TRUE, stratum = stratum,
p = p, pf.list = alt.pf.list)
lambda.matrix <- mapply(function (x,y) x*y, lambda, alt.pf.list)
lambda.list <- lapply(seq_len(ncol(lambda.matrix)), function(i) lambda.matrix[,i])
```

```
stable4 <- stable.clr.g(response = Y, p = p,
standardize = TRUE,
penalized = X, stratum = stratum,
lambda.list = lambda.list )
```

The histogram of selection probabilities:

Boulesteix, A. L., De Bin, R., Jiang, X., & Fuchs, M. (2017). IPF-LASSO: Integrative-penalized regression with penalty factors for prediction based on multi-omics data. Computational and mathematical methods in medicine, 2017.

Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.

Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80.