Type: | Package |
Title: | Variable Selection in High-Dimensional Logistic Regression Models using a Whitening Approach |
Version: | 2.1 |
Date: | 2023-07-17 |
Author: | Wencan Zhu |
Maintainer: | Wencan Zhu <wencan.zhu@yahoo.com> |
Description: | It proposes a novel variable selection approach in classification problem that takes into account the correlations that may exist between the predictors of the design matrix in a high-dimensional logistic model. Our approach consists in rewriting the initial high-dimensional logistic model to remove the correlation between the predictors and in applying the generalized Lasso criterion. |
License: | GPL-2 |
Imports: | cvCovEst, genlasso, tibble, MASS, ggplot2, Matrix, glmnet, corpcor |
VignetteBuilder: | knitr |
Suggests: | knitr |
Depends: | R (≥ 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2023-07-17 06:44:04 UTC; chenxi |
Repository: | CRAN |
Date/Publication: | 2023-07-17 07:10:06 UTC |
Variable Selection in High-Dimensional Logistic Regression Models using a Whitening Approach
Description
It proposes a novel variable selection approach in classification problem that takes into account the correlations that may exist between the predictors of the design matrix in a high-dimensional logistic model. Our approach consists in rewriting the initial high-dimensional logistic model to remove the correlation between the predictors and in applying the generalized Lasso criterion.
Details
The DESCRIPTION file:
Package: | WLogit |
Type: | Package |
Title: | Variable Selection in High-Dimensional Logistic Regression Models using a Whitening Approach |
Version: | 2.1 |
Date: | 2023-07-17 |
Author: | Wencan Zhu |
Maintainer: | Wencan Zhu <wencan.zhu@yahoo.com> |
Description: | It proposes a novel variable selection approach in classification problem that takes into account the correlations that may exist between the predictors of the design matrix in a high-dimensional logistic model. Our approach consists in rewriting the initial high-dimensional logistic model to remove the correlation between the predictors and in applying the generalized Lasso criterion. |
License: | GPL-2 |
Imports: | cvCovEst, genlasso, tibble, MASS, ggplot2, Matrix, glmnet, corpcor |
VignetteBuilder: | knitr |
Suggests: | knitr |
Depends: | R (>= 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2023-07-17 07:06:43 UTC; mmip |
Index of help topics:
CalculPx Calculate the class-conditional probabilities. CalculWeight Calculate the weight Refit_glm Refit the logistic regression with chosen variables Thresholding Thresholding on a vector WLogit-package Variable Selection in High-Dimensional Logistic Regression Models using a Whitening Approach WhiteningLogit Variable selection in high-dimensional logistic regression models using a whitening approach WorkingResp Calculate the working response X Example of a design matrix of a logistic model beta True coefficients in the esample. test WLogit output top Thresholding to zero of the smallest values top_thresh Thresholding to a given threshold of the smallest values y Example of a binary response variable of a logistic model.
Further information is available in the following vignettes:
Vignettes | WLogit package (source, pdf) |
This package consists of functions: "WhiteningLogit", "CalculPx", "CalculWeight", "Refit_glm", "top", "top_thresh", "WorkingResp", and "Thresholding". For further information on how to use these functions, we refer the reader to the vignette of the package.
Author(s)
Wencan Zhu
Maintainer: Wencan Zhu <wencan.zhu@yahoo.com>
References
W. Zhu, C. Levy-Leduc, N. Ternes. "Variable selection in high-dimensional logistic regression models using a whitening approach". (2022)
Calculate the class-conditional probabilities.
Description
Calculate the probability for a repsonse to be 1 in the logistic regression model.
Usage
CalculPx(X, beta, intercept = 0)
Arguments
X |
Design matrix of the logistic model considered. |
beta |
Vector of coefficients of the logistic model considered. |
intercept |
Whether there is the intercept |
Value
prob |
the probability for a repsonse to be 1 |
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
See Also
Please read https://hastie.su.domains/Papers/glmnet.pdf for more details
Examples
data(X)
data(beta)
CalculPx(X=X, beta=beta)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function (X, beta, intercept = 0)
{
prob <- 1/(1 + exp(-(X %*% beta + intercept)))
return(prob)
}
Calculate the weight
Description
Calculate the weight in the penalized weighted- least-squares problem
Usage
CalculWeight(Px)
Arguments
Px |
The vector of estimated probability for each response to be 1. |
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
See Also
Please read https://hastie.su.domains/Papers/glmnet.pdf for more details
Examples
data(X)
data(beta)
px <- CalculPx(X=X, beta=beta)
CalculWeight(px)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function (Px)
{
return(Px * (1 - Px))
}
Refit the logistic regression with chosen variables
Description
Refit the logistic regression with chosen variables.
Usage
Refit_glm(X, beta_pred, y)
Arguments
X |
Design matrix of the logistic model considered. |
beta_pred |
Predicted coefficients to be refited. |
y |
Binary response |
Value
beta_refit |
The new estimated coefficients |
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
Examples
data(X)
data(y)
data(beta)
Refit_glm(X=X, beta_pred=beta, y=y)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function (X, beta_pred, y)
{
X_temp <- X[, which(beta_pred != 0)]
if (length(which(beta_pred != 0)) == 0) {
coef_est <- beta_pred
}
else if (is.null(ncol(X_temp))) {
mydata <- data.frame(Y = y, X_temp)
colnames(mydata) <- c("Y", "X")
formula <- paste0("Y~-1 +", paste0(colnames(mydata)[-which(colnames(mydata) ==
"Y")], collapse = " + "))
myform <- as.formula(formula)
mod_lm <- glm(myform, data = mydata, family = "binomial")
coef_est <- mod_lm$coefficients
}
else {
mydata <- data.frame(Y = y, as.matrix(X_temp))
formula <- paste0("Y~-1 +", paste0(colnames(mydata)[-which(colnames(mydata) ==
"Y")], collapse = " + "))
myform <- as.formula(formula)
if (length(which(beta_pred != 0)) >= length(y)) {
mod_ridge <- cv.glmnet(x = as.matrix(X_temp), y = y,
alpha = 0, intercept = FALSE, family = "binomial")
opt_lambda <- mod_ridge$lambda[which.min(mod_ridge$cvm)]
coef_est <- as.vector(glmnet(x = as.matrix(X), y = y,
alpha = 0, intercept = FALSE, family = "binomial",
lambda = opt_lambda)$beta)
}
else {
mod_lm <- glm(myform, data = mydata, family = "binomial")
coef_est <- mod_lm$coefficients
}
}
beta_refit <- rep(0, length(beta_pred))
beta_refit[which(beta_pred != 0)] <- coef_est
return(beta_refit)
}
Thresholding on a vector
Description
This function provides the thresholding (correction) given a vector. It calls the function top
or top_thresh
in the same package, and the output is the vector after correction with the optimal threshold parameter.
Usage
Thresholding(X, y, coef, TOP)
Arguments
X |
Design matrix of the logistic model considered. |
y |
Binary response |
coef |
Candidate vector to be corrected |
TOP |
The grill of thresholding |
Value
opt_top |
The optimal threshold |
auc |
the log-likelihood for each grill of thresholding |
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
Variable selection in high-dimensional logistic regression models using a whitening approach
Description
Variable selection in high-dimensional logistic regression models using a whitening approach
Usage
WhiteningLogit(X = X, y = y, nlambda = 50, maxit = 100, gamma = 0.9999,
top_grill=c(1:100))
Arguments
X |
Design matrix of the logistic model considered. |
y |
Binary response of the logistic model considered. |
nlambda |
Number of lambda |
maxit |
Integer specifying the maximum number of steps for the generalized Lasso algorithm. It should not be smaller than nlambda. |
gamma |
Parameter |
top_grill |
A grill of provided for the thresholding |
Value
Returns a list with the following components
lambda |
different values of the parameter |
beta |
matrix of the estimations of |
beta.min |
estimation of |
log.likelihood |
Log-likelihood for all the |
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
References
W. Zhu, C. Levy-Leduc, N. Ternes. "Variable selection in high-dimensional logistic regression models using a whitening approach". (2022)
Examples
X0 <- matrix( rnorm(50*10,mean=0,sd=1), 50, 10)
y0 <- c(rep(1,25), rep(0,25))
mod <- WhiteningLogit(X=X0, y=y0)
plot(mod$beta.min)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function(X=X, y=y,
nlambda=50,
maxit=100,
gamma=0.9999,
top_grill=c(1:100)){
p=ncol(X)
n=nrow(X)
mod_ridge <- cv.glmnet(x=as.matrix(X), y=y, alpha=0.5, intercept=FALSE, family="binomial")
pr_est <- predict(mod_ridge, as.matrix(X), s = "lambda.min", type="response")
beta_ini <- predict(mod_ridge, as.matrix(X), s = "lambda.min", type="coefficients")[-1]
diag_w <- pr_est*(1-pr_est)
square_root_w <- diag(sqrt(as.vector(diag_w)), nrow=n)
X_new <- square_root_w
Cov_est <- cvCovEst(
dat = X_new,
estimators = c(
linearShrinkLWEst, thresholdingEst, sampleCovEst
),
estimator_params = list(
thresholdingEst = list(gamma = seq(0.1, 0.3, 0.1))
),
center = TRUE,
scale = TRUE
)
Sigma_est <- Cov_est$estimate
SVD_new <- fast.svd(Sigma_est)
U_sigma_new <- SVD_new$u
D_sigma_new <- SVD_new$d
inv_transmat <- U_sigma_new
inv_diag_new <- ifelse(D_sigma_new<0.000001, 0, 1/sqrt(D_sigma_new))
trans_mat <- U_sigma_new
if (p <= 50) {
top_grill <- seq(1, p, 2)
}else if (p <= 200) {
top_grill <- c(1:50, seq(52, p, 2))
}else if (p <= 300) {
top_grill <- c(1:50, seq(52, 100, 2), seq(105, 200, 5),
seq(210, p, 10))
}else {
top_grill <- c(1:50, seq(52, 100, 2), seq(105, 200, 5),
seq(210, 300, 10))
}
X_tilde <- X
beta_tilde_ini <- inv_transmat
Px <- CalculPx(X_tilde, beta=beta_tilde_ini)
wt <- CalculWeight(Px)
# wt <- ifelse(wt0==0, 0.0001, wt0)
ystar <- WorkingResp(y=y, Px=Px, X=X_tilde, beta=beta_tilde_ini)
X_tilde_weighted <- sweep(X, MARGIN=1, sqrt(wt), `*`)
ystar_weighted <- sqrt(wt)*ystar
gen.model0 <- genlasso(y=ystar_weighted, X=X_tilde_weighted,
D=trans_mat, maxsteps = 50)
parameter_tmp <- beta_tilde_ini
beta_final <- matrix(NA, length(gen.model0$lambda), p)
skip_i <- TRUE
eval_final <- c()
defaultW <- getOption("warn")
options(warn = -1)
for(i in 1:length(gen.model0$lambda)){
#inner loop
epsilon=10
j=0
if(skip_i){parameter_tmp <- beta_tilde_ini
} else {parameter_tmp <- parameter_current}
skip_i <-FALSE
while(epsilon > 0.001){
j=j+1
parameter_current <- parameter_tmp
Px <- CalculPx(X_tilde, beta=parameter_current)
wt0 <- CalculWeight(Px)
wt <- ifelse(round(wt0,4)==0, 0.0001, wt)
ystar <- WorkingResp(y=y, Px=Px, X=X_tilde, beta=parameter_current)
X_tilde_weighted <- sweep(X, MARGIN=1, sqrt(wt), `*`)
ystar_weighted <- sqrt(wt)*ystar
gen.model <- genlasso(y=ystar_weighted, X=X_tilde_weighted, D=trans_mat, maxsteps = maxit)
if(gen.model0$lambda[i] < min(gen.model$lambda)){
parameter_tmp <- parameter_current
break
} else {
parameter_tmp <- coef(gen.model, lambda=gen.model0$lambda[i],
type = "primal")$beta
beta_current <- parameter_tmp
if(sum(is.na(parameter_tmp))>0){
skip_i <-TRUE
parameter_tmp <- rep(0,p)
break}
epsilon <- max(abs(parameter_current-parameter_tmp))
if(epsilon >=100){
skip_i <-TRUE
break}
if (j==maxit){
skip_i <-TRUE
break}
}
}
if(skip_i){
beta_final[i, ] <- rep(NA, p)
eval_final[i] <- NA
} else{
correction <- Thresholding(X_tilde, y, coef=parameter_tmp, TOP=top_grill)
opt_top_tilde <- correction$opt_top
beta_tilde_opt <- top_thresh(vect=parameter_tmp, thresh = opt_top_tilde)
beta_final0 <- trans_mat
correction <- Thresholding(X, y, coef=beta_final0, TOP=top_grill)
opt_top_final <- correction$opt_top
beta_final[i, ] <- beta_opt_final <- top(vect=beta_final0, thresh = opt_top_final)
beta_refit <- Refit_glm(X=X, beta_pred = beta_opt_final, y=y)
pr_est <- CalculPx(X, beta_refit)
ll <- pr_est^y*(1-pr_est)^(1-y)
#ll <- ifelse(ll<0.000001, 1, ll)
eval_final[i] <- -log(prod(ll))
}
beta.min <- beta_final[which.min(eval_final), ]
}
options(warn = defaultW)
return(list(beta=beta_final, lambda=gen.model0$lambda, beta.min=beta.min,
log.likelihood=eval_final))
}
Calculate the working response
Description
Calculate the working response in the iterative least square regression
Usage
WorkingResp(y, Px, X, beta, intercept = 0)
Arguments
X |
Design matrix of the logistic model considered. |
y |
Binary response of the logistic model considered. |
Px |
The probability of the reponse to be 1 |
beta |
Vector of coefficients |
intercept |
If there is an intercept |
Value
This function returns the vector of working response.
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
See Also
Please read https://hastie.su.domains/Papers/glmnet.pdf for more details
Example of a design matrix of a logistic model
Description
It contains an example of a design matrix of a logistic model.
Usage
data("X")
Format
The format is: num [1:100, 1:500] -1.576 -0.476 -0.237 -0.398 0.284 ...
Examples
data(X)
True coefficients in the esample.
Description
True coefficients in the esample given in the vignette.
Usage
data("beta")
Format
The format is: num [1:500] 1 1 1 1 1 1 1 1 1 1 ...
Examples
data(beta)
plot(beta)
WLogit output
Description
The output of WLogit in the example given in the vignette.
Usage
data("test")
Format
The format is: List of 4 $ beta : num [1:50, 1:500] 0 0 0 0 0 ... $ lambda : num [1:50] 100.8 80 73 58.9 56.7 ... $ beta.min : num [1:500] 0.0194 0.0348 0.0259 0.0287 0.0385 ... $ log.likelihood: num [1:50] 57.7 57.7 57.7 57.7 57.7 ...
Examples
data(test)
str(test)
Thresholding to zero of the smallest values
Description
This function keeps only the K largest values of the vector sorted_vect
and sets the others to zero.
Usage
top(vect, thresh)
Arguments
vect |
vector to threshold |
thresh |
threshold |
Value
This function returns the thresholded vector.
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
Examples
x=sample(1:10,10)
thresh=3
top(x,thresh)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function (vect, thresh)
{
sorted_vect <- sort(abs(vect), decreasing = TRUE)
v = sorted_vect[thresh]
ifelse(abs(vect) >= v, vect, 0)
}
Thresholding to a given threshold of the smallest values
Description
This function keeps only the K largest values of the vector vect
and sets the others to the smallest value among the K largest.
Usage
top_thresh(vect, thresh)
Arguments
vect |
vector to threshold |
thresh |
threshold |
Value
This function returns the thresholded vector.
Author(s)
Wencan Zhu, Celine Levy-Leduc, Nils Ternes
Examples
x=sample(1:10,10)
sorted_vect=sort(x,decreasing=TRUE)
thresh=3
top_thresh(x,thresh)
##---- Should be DIRECTLY executable !! ----
##-- ==> Define data, use random,
##-- or do help(data=index) for the standard data sets.
## The function is currently defined as
function (vect, thresh)
{
sorted_vect <- sort(vect, decreasing = TRUE)
v = sorted_vect[thresh]
ifelse(vect >= v, vect, v)
}
Example of a binary response variable of a logistic model.
Description
It contains an example of a binary response variable of a logistic model.
Usage
data("y")
Format
The format is: int [1:100] 0 1 0 1 1 0 0 0 1 1 ...
Examples
data(y)