---
title: "Best subset selection with combss"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Best subset selection with combss}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
set.seed(1)
```

`combss` solves the best subset selection problem in generalised linear
models by reformulating it as a continuous optimisation over the
hypercube $[0, 1]^p$ and applying a Frank-Wolfe homotopy algorithm. The
inner ridge problem is solved with `glmnet`. Three families are
supported:

- `family = "gaussian"` (alias `"linear"`) — linear regression
- `family = "binomial"` — binary logistic regression
- `family = "multinomial"` — multinomial logistic regression

```{r}
library(combss)
```

## Linear regression

A simulated dataset with $n = 300$ observations and $p = 30$ predictors,
of which only the first five are truly active.

```{r}
set.seed(102)
n <- 300; p <- 30
beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5))
x <- matrix(rnorm(n * p), n, p)
y <- as.numeric(x %*% beta + rnorm(n) * 0.5)

itr <- 1:200; iva <- 201:300
fit <- combss(x[itr, ], y[itr],
              x_val = x[iva, ], y_val = y[iva],
              family = "gaussian", q = 10)
fit
```

`fit$subset_list` contains the selected feature set for each `k = 1, ..., q`:

```{r}
fit$subset_list[1:8]
```

`fit$mse_path` gives the validation MSE at each `k`, and `fit$best_k`
the size with smallest MSE.

```{r}
plot(seq_along(fit$mse_path), fit$mse_path, type = "b",
     xlab = "k", ylab = "Validation MSE")
abline(v = fit$best_k, lty = 2)
```

## Binary logistic regression

```{r}
ybin <- as.numeric(plogis(x %*% beta) > 0.5)
fit_bin <- combss(x[itr, ], ybin[itr],
                  x_val = x[iva, ], y_val = ybin[iva],
                  family = "binomial", q = 10)
fit_bin
```

## Multinomial logistic regression

```{r}
ymulti <- cut(as.numeric(x %*% beta),
              breaks = c(-Inf, -1, 1, Inf),
              labels = c("a", "b", "c"))
fit_mn <- combss(x[itr, ], ymulti[itr],
                 x_val = x[iva, ], y_val = ymulti[iva],
                 family = "multinomial", q = 10)
fit_mn
```

## LOOCV ridge selection

For each candidate ridge penalty `lam_ridge`, `combss_cv()` runs COMBSS
and evaluates LOOCV error on the chosen subset.


```{r, eval = FALSE}
cv <- combss_cv(x, y, family = "gaussian", q = 6)
cv$best_lambda
```


## Predicting on new data

`predict()` refits on the chosen subset of the original training data
and predicts on `newx`.

```{r}
preds <- predict(fit, newx = x[iva, ],
                 x_train = x[itr, ], y_train = y[itr])
head(preds)
```

## References

- Moka, S., Liquet, B., Zhu, H. and Muller, S. (2024).
  COMBSS: best subset selection via continuous optimization.
  *Statistics and Computing*. \doi{10.1007/s11222-024-10387-8}
- Mathur, A., Liquet, B., Muller, S. and Moka, S. (2026).
  Parsimonious Subset Selection for Generalized Linear Models with
  Biomedical Applications. arXiv:2603.21952.