combss solves the best subset selection problem in
generalised linear models by reformulating it as a continuous
optimisation over the hypercube \([0,
1]^p\) and applying a Frank-Wolfe homotopy algorithm. The inner
ridge problem is solved with glmnet. Three families are
supported:
family = "gaussian" (alias "linear") —
linear regressionfamily = "binomial" — binary logistic regressionfamily = "multinomial" — multinomial logistic
regressionA simulated dataset with \(n = 300\) observations and \(p = 30\) predictors, of which only the first five are truly active.
set.seed(102)
n <- 300; p <- 30
beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5))
x <- matrix(rnorm(n * p), n, p)
y <- as.numeric(x %*% beta + rnorm(n) * 0.5)
itr <- 1:200; iva <- 201:300
fit <- combss(x[itr, ], y[itr],
x_val = x[iva, ], y_val = y[iva],
family = "gaussian", q = 10)
fit
#> COMBSS fit
#> family: gaussian
#> n, p: 200, 30
#> q: 10
#> lam_ridge: 0
#> best k: 5
#> mse: 0.22222
#> subset: 1, 2, 3, 4, 5fit$subset_list contains the selected feature set for
each k = 1, ..., q:
fit$subset_list[1:8]
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1 2
#>
#> [[3]]
#> [1] 1 2 3
#>
#> [[4]]
#> [1] 1 2 3 4
#>
#> [[5]]
#> [1] 1 2 3 4 5
#>
#> [[6]]
#> [1] 1 2 3 4 5 6
#>
#> [[7]]
#> [1] 1 2 3 4 5 6 21
#>
#> [[8]]
#> [1] 1 2 3 4 5 6 10 21fit$mse_path gives the validation MSE at each
k, and fit$best_k the size with smallest
MSE.
plot(seq_along(fit$mse_path), fit$mse_path, type = "b",
xlab = "k", ylab = "Validation MSE")
abline(v = fit$best_k, lty = 2)ymulti <- cut(as.numeric(x %*% beta),
breaks = c(-Inf, -1, 1, Inf),
labels = c("a", "b", "c"))
fit_mn <- combss(x[itr, ], ymulti[itr],
x_val = x[iva, ], y_val = ymulti[iva],
family = "multinomial", q = 10)
fit_mn
#> COMBSS fit
#> family: multinomial
#> n, p: 200, 30
#> q: 10
#> lam_ridge: 0
#> best k: 6
#> accuracy: 0.87
#> subset: 1, 2, 3, 4, 5, 12For each candidate ridge penalty lam_ridge,
combss_cv() runs COMBSS and evaluates LOOCV error on the
chosen subset.
predict() refits on the chosen subset of the original
training data and predicts on newx.