--- title: "Best subset selection with combss" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Best subset selection with combss} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) set.seed(1) ``` `combss` solves the best subset selection problem in generalised linear models by reformulating it as a continuous optimisation over the hypercube $[0, 1]^p$ and applying a Frank-Wolfe homotopy algorithm. The inner ridge problem is solved with `glmnet`. Three families are supported: - `family = "gaussian"` (alias `"linear"`) — linear regression - `family = "binomial"` — binary logistic regression - `family = "multinomial"` — multinomial logistic regression ```{r} library(combss) ``` ## Linear regression A simulated dataset with $n = 300$ observations and $p = 30$ predictors, of which only the first five are truly active. ```{r} set.seed(102) n <- 300; p <- 30 beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5)) x <- matrix(rnorm(n * p), n, p) y <- as.numeric(x %*% beta + rnorm(n) * 0.5) itr <- 1:200; iva <- 201:300 fit <- combss(x[itr, ], y[itr], x_val = x[iva, ], y_val = y[iva], family = "gaussian", q = 10) fit ``` `fit$subset_list` contains the selected feature set for each `k = 1, ..., q`: ```{r} fit$subset_list[1:8] ``` `fit$mse_path` gives the validation MSE at each `k`, and `fit$best_k` the size with smallest MSE. ```{r} plot(seq_along(fit$mse_path), fit$mse_path, type = "b", xlab = "k", ylab = "Validation MSE") abline(v = fit$best_k, lty = 2) ``` ## Binary logistic regression ```{r} ybin <- as.numeric(plogis(x %*% beta) > 0.5) fit_bin <- combss(x[itr, ], ybin[itr], x_val = x[iva, ], y_val = ybin[iva], family = "binomial", q = 10) fit_bin ``` ## Multinomial logistic regression ```{r} ymulti <- cut(as.numeric(x %*% beta), breaks = c(-Inf, -1, 1, Inf), labels = c("a", "b", "c")) fit_mn <- combss(x[itr, ], ymulti[itr], x_val = x[iva, ], y_val = ymulti[iva], family = "multinomial", q = 10) fit_mn ``` ## LOOCV ridge selection For each candidate ridge penalty `lam_ridge`, `combss_cv()` runs COMBSS and evaluates LOOCV error on the chosen subset. ```{r, eval = FALSE} cv <- combss_cv(x, y, family = "gaussian", q = 6) cv$best_lambda ``` ## Predicting on new data `predict()` refits on the chosen subset of the original training data and predicts on `newx`. ```{r} preds <- predict(fit, newx = x[iva, ], x_train = x[itr, ], y_train = y[itr]) head(preds) ``` ## References - Moka, S., Liquet, B., Zhu, H. and Muller, S. (2024). COMBSS: best subset selection via continuous optimization. *Statistics and Computing*. \doi{10.1007/s11222-024-10387-8} - Mathur, A., Liquet, B., Muller, S. and Moka, S. (2026). Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications. arXiv:2603.21952.