SGS reproducible example

Fabio Feser

2023-08-21

Introduction

Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in .

The method is implemented in the sgs R package. The package has implementations for Gaussian and Binomial responses, both of which are demonstrated here.

Gaussian response

Data

For this example, a \(400\times 500\) input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.

library(sgs)
groups = c(rep(1:20, each=3),
           rep(21:40, each=4),
           rep(41:60, each=5),
           rep(61:80, each=6),
           rep(81:100, each=7))

data = generate_toy_data(p=500, n=400, groups = groups, seed_id=3)

Fitting an SGS model

We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the \(\lambda\) parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.

model = fit_sgs(X = data$X, y = data$y, groups = groups, type="linear", lambda = 1, alpha=0.95, vFDR=0.1, gFDR=0.1, standardise = "l2", intercept = TRUE, verbose=FALSE)

Note: we have fit an intercept and applied \(\ell_2\) standardisation. This is the recommended usage when applying SGS.

Output of model

The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.

model$beta[model$selected_var+1,]
##       v100       v136       v234       v334 
##  1.2211478  3.0393493  2.4211141 -0.4108451
model$group.effects[model$selected_group,]
##       G30       G39       G59       G76 
## 1.2211478 3.0393493 2.4211141 0.4108451
model$selected_var
## [1] 100 136 234 334
model$selected_group
## [1] 30 39 59 76

Defining a function that lets us calculate various metrics (including the FDR and sensitivity):

fdr_sensitivity = function(fitted_ids, true_ids,num_coef){
  # calculates FDR, FPR, and sensitivity
  num_true = length(intersect(fitted_ids,true_ids))
  num_false = length(fitted_ids) - num_true
  num_missed = length(true_ids) - num_true
  num_true_negatives = num_coef - length(true_ids)
  out=c()
  out$fdr = num_false / (num_true + num_false)
  if (is.nan(out$fdr)){out$fdr = 0}
  out$sensitivity = num_true / length(true_ids)
  if (length(true_ids) == 0){
    out$sensitivity = 1
  }
  out$fpr = num_false / num_true_negatives
  out$f1 = (2*num_true)/(2*num_true + num_false + num_missed)
  if (is.nan(out$f1)){out$f1 = 1}
  return(out)
}

Calculating relevant metrics give

fdr_sensitivity(fitted_ids = model$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.1428571
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.25
fdr_sensitivity(fitted_ids = model$selected_group, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.4
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.5714286

The model is currently too sparse, as our choice of \(\lambda\) is too high. We can instead use cross-validation.

Cross validation

Cross-validation is used to fit SGS models along a \(\lambda\) path of length \(20\). The first value, \(\lambda_\text{max}\), is chosen to give the null model and the path is terminated at \(\lambda_\text{min} = min_frac \dot \lambda_\text{max}\). The 1se rule (as in the package) is used to choose the optimal model.

cv_model = fit_sgs_cv(X = data$X, y = data$y, groups=groups, type = "linear", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=TRUE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 1.8388. Number of non-zero: 0. Error: 9522.47623165794. Avg iter: 7"
## [1] "Lambda 2/20 done. Lambda: 1.5706. Number of non-zero: 1. Error: 9121.87427456379. Avg iter: 26"
## [1] "Lambda 3/20 done. Lambda: 1.3415. Number of non-zero: 2. Error: 8348.10001181521. Avg iter: 31"
## [1] "Lambda 4/20 done. Lambda: 1.1458. Number of non-zero: 3. Error: 7525.33322733818. Avg iter: 35"
## [1] "Lambda 5/20 done. Lambda: 0.9787. Number of non-zero: 8. Error: 6706.42167334536. Avg iter: 35"
## [1] "Lambda 6/20 done. Lambda: 0.8359. Number of non-zero: 14. Error: 5509.29608346792. Avg iter: 44"
## [1] "Lambda 7/20 done. Lambda: 0.714. Number of non-zero: 15. Error: 4256.09986924448. Avg iter: 37"
## [1] "Lambda 8/20 done. Lambda: 0.6098. Number of non-zero: 15. Error: 3281.14251906525. Avg iter: 37"
## [1] "Lambda 9/20 done. Lambda: 0.5209. Number of non-zero: 15. Error: 2559.14833615298. Avg iter: 37"
## [1] "Lambda 10/20 done. Lambda: 0.4449. Number of non-zero: 19. Error: 2011.70593862728. Avg iter: 36"
## [1] "Lambda 11/20 done. Lambda: 0.38. Number of non-zero: 22. Error: 1552.91875758507. Avg iter: 39"
## [1] "Lambda 12/20 done. Lambda: 0.3246. Number of non-zero: 22. Error: 1179.20560203183. Avg iter: 33"
## [1] "Lambda 13/20 done. Lambda: 0.2772. Number of non-zero: 22. Error: 892.913519340801. Avg iter: 32"
## [1] "Lambda 14/20 done. Lambda: 0.2368. Number of non-zero: 22. Error: 682.816068341828. Avg iter: 34"
## [1] "Lambda 15/20 done. Lambda: 0.2022. Number of non-zero: 24. Error: 525.918317926705. Avg iter: 36"
## [1] "Lambda 16/20 done. Lambda: 0.1727. Number of non-zero: 24. Error: 405.623243053464. Avg iter: 37"
## [1] "Lambda 17/20 done. Lambda: 0.1475. Number of non-zero: 26. Error: 314.722794039507. Avg iter: 39"
## [1] "Lambda 18/20 done. Lambda: 0.126. Number of non-zero: 26. Error: 244.495442693249. Avg iter: 39"
## [1] "Lambda 19/20 done. Lambda: 0.1076. Number of non-zero: 27. Error: 192.736115731821. Avg iter: 40"
## [1] "Lambda 20/20 done. Lambda: 0.0919. Number of non-zero: 27. Error: 154.39418403602. Avg iter: 41"

The fitting verbose contains useful information, showing the error for each \(\lambda\) values, as well as the number of non-zero parameters. Aside from the fitting verbose, we can see a more succinct summary by using the function

print(cv_model)
## 
##  regression type:  linear 
## 
##          lambda     error estimated_non_zero
##  [1,] 1.8387781 9522.4762                  0
##  [2,] 1.5705583 9121.8743                  1
##  [3,] 1.3414633 8348.1000                  2
##  [4,] 1.1457860 7525.3332                  3
##  [5,] 0.9786519 6706.4217                  8
##  [6,] 0.8358975 5509.2961                 14
##  [7,] 0.7139663 4256.0999                 15
##  [8,] 0.6098211 3281.1425                 15
##  [9,] 0.5208674 2559.1483                 15
## [10,] 0.4448893 2011.7059                 19
## [11,] 0.3799940 1552.9188                 22
## [12,] 0.3245648 1179.2056                 22
## [13,] 0.2772210  892.9135                 22
## [14,] 0.2367832  682.8161                 22
## [15,] 0.2022440  525.9183                 24
## [16,] 0.1727430  405.6232                 24
## [17,] 0.1475452  314.7228                 26
## [18,] 0.1260230  244.4954                 26
## [19,] 0.1076402  192.7361                 27
## [20,] 0.0919389  154.3942                 27

The best model is found to be the one at the end of the path. Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values.

fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id, num_coef = 500)
## $fdr
## [1] 0.03703704
## 
## $sensitivity
## [1] 0.9285714
## 
## $fpr
## [1] 0.002118644
## 
## $f1
## [1] 0.9454545
fdr_sensitivity(fitted_ids = cv_model$fit$selected_group, true_ids = data$true_grp_id, num_coef = 100)
## $fdr
## [1] 0.09090909
## 
## $sensitivity
## [1] 1
## 
## $fpr
## [1] 0.01111111
## 
## $f1
## [1] 0.952381

Plot

We can visualise the solution using the plot function

plot(cv_model,how_many = 10)

Prediction

The package has an implemented predict function, to allow for easy prediction

predict(model,data$X,type="linear")[1:5]
## [1]  2.79322803  0.03135326  5.00188815  4.00179438 -2.74987113

Logistic regression

As mentioned, the package can also be used to fit SGS to a Binomial response. First, we generate some Binomial data. We can use the same input matrix, \(X\), and true \(\beta\) as before.

sigmoid = function(x) {
   1 / (1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400))>0.5,1,0)
train_y = y[1:350] 
test_y = y[351:400]
train_X = data$X[1:350,] 
test_X = data$X[351:400,]

Fitting and prediction

We can again apply CV.

cv_model = fit_sgs_cv(X = train_X, y = train_y, groups=groups, type = "logistic", nlambda = 20, nfolds=10, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise="l2",intercept=FALSE,verbose=TRUE)
## [1] "Lambda 1/20 done. Lambda: 0.0486. Number of non-zero: 0. Misclassification error: 0.437142857142857. Avg iter: 2"
## [1] "Lambda 2/20 done. Lambda: 0.0415. Number of non-zero: 1. Misclassification error: 0.351428571428571. Avg iter: 4"
## [1] "Lambda 3/20 done. Lambda: 0.0355. Number of non-zero: 2. Misclassification error: 0.331428571428571. Avg iter: 5"
## [1] "Lambda 4/20 done. Lambda: 0.0303. Number of non-zero: 2. Misclassification error: 0.314285714285714. Avg iter: 7"
## [1] "Lambda 5/20 done. Lambda: 0.0259. Number of non-zero: 9. Misclassification error: 0.28. Avg iter: 11"
## [1] "Lambda 6/20 done. Lambda: 0.0221. Number of non-zero: 11. Misclassification error: 0.222857142857143. Avg iter: 9"
## [1] "Lambda 7/20 done. Lambda: 0.0189. Number of non-zero: 13. Misclassification error: 0.208571428571429. Avg iter: 11"
## [1] "Lambda 8/20 done. Lambda: 0.0161. Number of non-zero: 23. Misclassification error: 0.177142857142857. Avg iter: 15"
## [1] "Lambda 9/20 done. Lambda: 0.0138. Number of non-zero: 34. Misclassification error: 0.14. Avg iter: 19"
## [1] "Lambda 10/20 done. Lambda: 0.0118. Number of non-zero: 43. Misclassification error: 0.134285714285714. Avg iter: 20"
## [1] "Lambda 11/20 done. Lambda: 0.0101. Number of non-zero: 57. Misclassification error: 0.131428571428571. Avg iter: 21"
## [1] "Lambda 12/20 done. Lambda: 0.0086. Number of non-zero: 80. Misclassification error: 0.142857142857143. Avg iter: 21"
## [1] "Lambda 13/20 done. Lambda: 0.0073. Number of non-zero: 102. Misclassification error: 0.14. Avg iter: 22"
## [1] "Lambda 14/20 done. Lambda: 0.0063. Number of non-zero: 113. Misclassification error: 0.148571428571429. Avg iter: 26"
## [1] "Lambda 15/20 done. Lambda: 0.0054. Number of non-zero: 120. Misclassification error: 0.145714285714286. Avg iter: 30"
## [1] "Lambda 16/20 done. Lambda: 0.0046. Number of non-zero: 124. Misclassification error: 0.145714285714286. Avg iter: 36"
## [1] "Lambda 17/20 done. Lambda: 0.0039. Number of non-zero: 132. Misclassification error: 0.145714285714286. Avg iter: 45"
## [1] "Lambda 18/20 done. Lambda: 0.0033. Number of non-zero: 136. Misclassification error: 0.137142857142857. Avg iter: 55"
## [1] "Lambda 19/20 done. Lambda: 0.0028. Number of non-zero: 154. Misclassification error: 0.148571428571429. Avg iter: 67"
## [1] "Lambda 20/20 done. Lambda: 0.0024. Number of non-zero: 164. Misclassification error: 0.148571428571429. Avg iter: 81"

and again, use the predict function

predictions = predict(cv_model$fit,test_X,type="logistic")

In the Binomial case, the function returns both the predicted class probabilities (response) and the predicted class (class). We can use this to check the prediction accuracy, given as \(84\%\).

predictions$response[1:5]
## [1] 0.4085523 0.6480921 0.1706632 0.1489437 0.3894489
predictions$class[1:5]
## [1] 0 1 0 0 0
sum(predictions$class == test_y)/length(test_y)
## [1] 0.82

Reference