Best subset selection with combss

combss solves the best subset selection problem in generalised linear models by reformulating it as a continuous optimisation over the hypercube \([0, 1]^p\) and applying a Frank-Wolfe homotopy algorithm. The inner ridge problem is solved with glmnet. Three families are supported:

library(combss)

Linear regression

A simulated dataset with \(n = 300\) observations and \(p = 30\) predictors, of which only the first five are truly active.

set.seed(102)
n <- 300; p <- 30
beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5))
x <- matrix(rnorm(n * p), n, p)
y <- as.numeric(x %*% beta + rnorm(n) * 0.5)

itr <- 1:200; iva <- 201:300
fit <- combss(x[itr, ], y[itr],
              x_val = x[iva, ], y_val = y[iva],
              family = "gaussian", q = 10)
fit
#> COMBSS fit
#>   family:    gaussian
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    5
#>   mse:       0.22222
#>   subset:    1, 2, 3, 4, 5

fit$subset_list contains the selected feature set for each k = 1, ..., q:

fit$subset_list[1:8]
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 1 2 3 4
#> 
#> [[5]]
#> [1] 1 2 3 4 5
#> 
#> [[6]]
#> [1] 1 2 3 4 5 6
#> 
#> [[7]]
#> [1]  1  2  3  4  5  6 21
#> 
#> [[8]]
#> [1]  1  2  3  4  5  6 10 21

fit$mse_path gives the validation MSE at each k, and fit$best_k the size with smallest MSE.

plot(seq_along(fit$mse_path), fit$mse_path, type = "b",
     xlab = "k", ylab = "Validation MSE")
abline(v = fit$best_k, lty = 2)

Binary logistic regression

ybin <- as.numeric(plogis(x %*% beta) > 0.5)
fit_bin <- combss(x[itr, ], ybin[itr],
                  x_val = x[iva, ], y_val = ybin[iva],
                  family = "binomial", q = 10)
fit_bin
#> COMBSS fit
#>   family:    binomial
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    5
#>   accuracy:  1
#>   subset:    1, 2, 3, 4, 5

Multinomial logistic regression

ymulti <- cut(as.numeric(x %*% beta),
              breaks = c(-Inf, -1, 1, Inf),
              labels = c("a", "b", "c"))
fit_mn <- combss(x[itr, ], ymulti[itr],
                 x_val = x[iva, ], y_val = ymulti[iva],
                 family = "multinomial", q = 10)
fit_mn
#> COMBSS fit
#>   family:    multinomial
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    6
#>   accuracy:  0.87
#>   subset:    1, 2, 3, 4, 5, 12

LOOCV ridge selection

For each candidate ridge penalty lam_ridge, combss_cv() runs COMBSS and evaluates LOOCV error on the chosen subset.

cv <- combss_cv(x, y, family = "gaussian", q = 6)
cv$best_lambda

Predicting on new data

predict() refits on the chosen subset of the original training data and predicts on newx.

preds <- predict(fit, newx = x[iva, ],
                 x_train = x[itr, ], y_train = y[itr])
head(preds)
#> [1] -2.94159809 -6.56104048  1.71124589  4.53489027  0.11400216 -0.02461332

References