Sparse-group SLOPE

Sparse-group SLOPE (SGS) is a penalised regression approach that performs bi-level selection with FDR control under orthogonal designs. SGS is described in detail in F. Feser, M. Evangelou (2023) “Sparse-group SLOPE: adaptive bi-level selection with FDR-control”.

Linear regression

Data

For this example, a \(400 \times 500\) input matrix is used with a simple grouping structure, sampled from a multivariate Gaussian distribution with no correlation.

library(sgs)
groups = c(rep(1:20, each = 3), rep(21:40, each = 4), rep(41:60,
    each = 5), rep(61:80, each = 6), rep(81:100, each = 7))

data = gen_toy_data(p = 500, n = 400, groups = groups, seed_id = 3)

Fitting an SGS model

We now fit an SGS model to the data using linear regression. The SGS model has many different hyperparameters which can be tuned/selected. Of particular importance is the \(\lambda\) parameter, which defines the level of sparsity in the model. First, we select this manually and then next use cross-validation to tune it. The other parameters we leave as their default values, although they can easily be changed.

model = fit_sgs(X = data$X, y = data$y, groups = groups, type = "linear",
    lambda = 0.5, alpha = 0.95, vFDR = 0.1, gFDR = 0.1, standardise = "l2",
    intercept = TRUE, verbose = FALSE, screen = TRUE)

Note: we have fit an intercept and applied \(\ell_2\) standardisation. This is the recommended usage when applying SGS with linear regression. The lambda values can also be calculated automatically, starting at the null model and continuing as specified by and :

model_path = fit_sgs(X = data$X, y = data$y, groups = groups,
    type = "linear", lambda = "path", path_length = 5, min_frac = 0.1,
    alpha = 0.95, vFDR = 0.1, gFDR = 0.1, standardise = "l2",
    intercept = TRUE, verbose = FALSE, screen = TRUE)

Output of model

The package provides several useful outputs after fitting a model. The vector shows the fitted values (note the intercept). We can also recover the indices of the non-zero variables and groups, which are indexed from the first variable, not the intercept.

model$beta[model$selected_var + 1]  # the +1 is to account for the intercept

##  [1] -1.70359057 -1.51619035  3.19485206 -0.03116136  1.66521214  5.06316228
##  [7] -1.48582935  1.30402890 -0.94992292  5.02074891 -0.67249170 -1.47281080
## [13] -2.26324184  1.64877215 -1.69735702 -0.74801830

model$group_effects[model$selected_grp]

## [1] 3.92531956 0.03116136 5.32996658 1.48582935 1.30402890 5.10982125 1.61907897
## [8] 2.26324184 2.48173363

model$selected_var

##  [1]  97  99 100 109 133 136 170 217 231 234 260 263 334 391 393 394

model$selected_grp

## [1] 30 33 39 46 56 59 64 76 85

Defining a function that lets us calculate various metrics (including the FDR and sensitivity):

fdr_sensitivity = function(fitted_ids, true_ids, num_coef) {
    # calculates FDR, FPR, and sensitivity
    num_true = length(intersect(fitted_ids, true_ids))
    num_false = length(fitted_ids) - num_true
    num_missed = length(true_ids) - num_true
    num_true_negatives = num_coef - length(true_ids)
    out = c()
    out$fdr = num_false/(num_true + num_false)
    if (is.nan(out$fdr)) {
        out$fdr = 0
    }
    out$sensitivity = num_true/length(true_ids)
    if (length(true_ids) == 0) {
        out$sensitivity = 1
    }
    out$fpr = num_false/num_true_negatives
    out$f1 = (2 * num_true)/(2 * num_true + num_false + num_missed)
    if (is.nan(out$f1)) {
        out$f1 = 1
    }
    return(out)
}

Calculating relevant metrics give

fdr_sensitivity(fitted_ids = model$selected_var, true_ids = data$true_var_id,
    num_coef = 500)

## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.5714286
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.7272727

fdr_sensitivity(fitted_ids = model$selected_grp, true_ids = data$true_grp_id,
    num_coef = 100)

## $fdr
## [1] 0
## 
## $sensitivity
## [1] 0.9
## 
## $fpr
## [1] 0
## 
## $f1
## [1] 0.9473684

The model is currently too sparse, as our choice of \(\lambda\) is too high. We can instead use cross-validation.

Cross validation

Cross-validation is used to fit SGS models along a \(\lambda\) path of length \(20\). The first value, \(\lambda_\text{max}\), is chosen to give the null model and the path is terminated at \(\lambda_\text{min} = \delta \lambda_\text{max}\), where \(\delta\) is some value between \(0\) and \(1\) (given by in the function). The 1se rule (as in the package) is used to choose the optimal model.

cv_model = fit_sgs_cv(X = data$X, y = data$y, groups = groups,
    type = "linear", path_length = 20, nfolds = 10, alpha = 0.95,
    vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise = "l2",
    intercept = TRUE, verbose = TRUE, screen = TRUE)

## [1] "Fold 1/10 done. Error: 3587.46352973189"
## [1] "Fold 2/10 done. Error: 2396.37451004873"
## [1] "Fold 3/10 done. Error: 3244.98656447063"
## [1] "Fold 4/10 done. Error: 3264.25029382017"
## [1] "Fold 5/10 done. Error: 3187.48670434299"
## [1] "Fold 6/10 done. Error: 2773.26771142741"
## [1] "Fold 7/10 done. Error: 3231.2422676838"
## [1] "Fold 8/10 done. Error: 4476.62268630096"
## [1] "Fold 9/10 done. Error: 3449.11503742538"
## [1] "Fold 10/10 done. Error: 3710.53353879839"

The fitting verbose contains useful information, showing the error for each fold. Aside from the fitting verbose, we can see a more succinct summary by using the function

print(cv_model)

## 
##  regression type:  
## 
##           lambda     error num.nonzero
##  [1,] 1.91363372 9524.4888           0
##  [2,] 1.63449483 9277.7401           1
##  [3,] 1.39607352 8542.7212           2
##  [4,] 1.19243036 7687.9926           3
##  [5,] 1.01849233 6915.9635           4
##  [6,] 0.86992638 5746.3576          14
##  [7,] 0.74303152 4427.2188          15
##  [8,] 0.63464662 3405.7949          15
##  [9,] 0.54207167 2654.6456          15
## [10,] 0.46300049 2084.2636          16
## [11,] 0.39546330 1618.8663          22
## [12,] 0.33777766 1226.0384          22
## [13,] 0.28850654  926.7801          22
## [14,] 0.24642252  707.1735          22
## [15,] 0.21047724  544.7951          24
## [16,] 0.17977524  420.1590          24
## [17,] 0.15355169  324.7006          26
## [18,] 0.13115334  251.3890          26
## [19,] 0.11202220  197.5157          27
## [20,] 0.09568169  158.0814          27

The best model is found to be the one at the end of the path:

cv_model$best_lambda_id

## [1] 20

Checking the metrics again, we see how CV has generated a model with the correct amount of sparsity that gives FDR levels below the specified values.

fdr_sensitivity(fitted_ids = cv_model$fit$selected_var, true_ids = data$true_var_id,
    num_coef = 500)

## $fdr
## [1] 0.03703704
## 
## $sensitivity
## [1] 0.9285714
## 
## $fpr
## [1] 0.002118644
## 
## $f1
## [1] 0.9454545

fdr_sensitivity(fitted_ids = cv_model$fit$selected_grp, true_ids = data$true_grp_id,
    num_coef = 100)

## $fdr
## [1] 0.09090909
## 
## $sensitivity
## [1] 1
## 
## $fpr
## [1] 0.01111111
## 
## $f1
## [1] 0.952381

Plot

We can visualise the solution using the plot function:

plot(cv_model, how_many = 10)

Prediction

The package has an implemented predict function to allow for easy prediction. The predict function can be used on regular and CV model fits.

predict(model, data$X)[1:5]

## [1]  7.5669801 -2.5809301  7.9012955  0.7832665 -4.9043138

dim(predict(cv_model, data$X))

## [1] 400  20

Logistic regression

As mentioned, the package can also be used to fit SGS to a binary response. First, we generate some binary data. We can use the same input matrix, \(X\), and true \(\beta\) as before. We split the data into train and test to test the models classification performance.

sigmoid = function(x) {
    1/(1 + exp(-x))
}
y = ifelse(sigmoid(data$X %*% data$true_beta + rnorm(400)) >
    0.5, 1, 0)
train_y = y[1:350]
test_y = y[351:400]
train_X = data$X[1:350, ]
test_X = data$X[351:400, ]

Fitting and prediction

We can again apply CV.

cv_model = fit_sgs_cv(X = train_X, y = train_y, groups = groups,
    type = "logistic", path_length = 20, nfolds = 10, alpha = 0.95,
    vFDR = 0.1, gFDR = 0.1, min_frac = 0.05, standardise = "l2",
    intercept = FALSE, verbose = TRUE, screen = TRUE)

## [1] "Fold 1/10 done. Error: 0.244285714285714"
## [1] "Fold 2/10 done. Error: 0.187142857142857"
## [1] "Fold 3/10 done. Error: 0.252857142857143"
## [1] "Fold 4/10 done. Error: 0.194285714285714"
## [1] "Fold 5/10 done. Error: 0.164285714285714"
## [1] "Fold 6/10 done. Error: 0.205714285714286"
## [1] "Fold 7/10 done. Error: 0.194285714285714"
## [1] "Fold 8/10 done. Error: 0.112857142857143"
## [1] "Fold 9/10 done. Error: 0.21"
## [1] "Fold 10/10 done. Error: 0.218571428571429"

and again, use the predict function

predictions = predict(cv_model, test_X)

For logistic regression, the function returns both the predicted class probabilities (response) and the predicted class (class). We can use this to check the prediction accuracy, given as \(82\%\).

predictions$response[1:5, cv_model$best_lambda_id]

## [1] 0.47968400 0.50407223 0.09066721 0.11938235 0.35718582

predictions$class[1:5, cv_model$best_lambda_id]

## [1] 0 1 0 0 0

sum(predictions$class[, cv_model$best_lambda_id] == test_y)/length(test_y)

## [1] 0.86

sgs reproducible example

Fabio Feser

2025-06-12

Introduction