---
title: "Fitting a sparse GLM with glm4"
author: "Angus Hughes"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fitting a sparse GLM with glm4}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# For reproducibility
set.seed(6006)

# Set the number of levels
n_levels <- 3000

# Randomly draw the number of observations per level
n_obs_lambda <- rgamma(n_levels, 3.1) * 100
n_obs_per_level <- rpois(n_levels, n_obs_lambda) + 1

# Simulate the mean probability per level
prob_per_level <- rbeta(n_levels, 3.1, 4.4)

# For each level, simulate data
d <- vector(mode = "list", length = n_levels)
for (l in seq_along(n_obs_per_level)){
	d[[l]] <- data.frame(
		y = rbinom(n_obs_per_level[l], 1, prob_per_level[l]),
		level = l)
}

# Combine into a data.frame
df <- do.call("rbind", d)
df$level <- as.factor(df$level)
```

# Getting started

To demonstrate the `glm4` package, let's generate a large hypothetical dataset. This data has `r format(n_levels, big.mark = ",")` levels of a factor and a binary outcome $y$. Each level has somewhere between 1 to over 1000 observations of the outcome and an average success probability.

```{r, echo=FALSE, fig.width=7, fig.height=4, fig.align = 'center'}
n_obs_total <- sum(n_obs_per_level)
hist(n_obs_per_level, breaks = 80,
		 main = paste(format(n_levels, big.mark = ","),
		 						 "levels, total N =",
		 						 format(n_obs_total, big.mark = ",")),
		 xlab = "Number of observations by level")
```

We are interested in fitting a logistic regression that includes all `r format(n_levels, big.mark=",")` factor levels as fixed effects. The model matrix has `r format(n_obs_total, big.mark=",")` rows and `r format(n_levels + 1L, big.mark=",")` columns — one per factor level plus an intercept.

This matrix is extremely sparse: each row contains exactly one `1` (for its factor level) and `r format(n_levels, big.mark=",")` zeros. In practice, on many machines calling `stats::glm()` here will either throw a `cannot allocate vector of size ...` error due to RAM constraints or crash the R session entirely.

`glm4()` with `sparse = TRUE` stores and operates on only the non-zero elements of the model matrix, making model fitting possible.

```{r}
library(glm4)

run_time <- system.time(
	fit <- glm4(y ~ level, data = df, family = binomial(), sparse = TRUE)
)

run_time
```

This took `r unname(round(run_time['elapsed'], 1))` seconds to fit.

We can briefly inspect the output:

```{r}
# Only print a subset of the output, restoring existing max.print after
old_options <- options(max.print = 20)
print(fit)
options(old_options)
```

For a saturated factor model, the model-implied probability for each level equals the observed proportion exactly. We can verify this: the inverse-logit of the level-specific linear predictor should match the raw sample mean.

```{r}
# Model-implied probability for level 2
plogis(sum(coef(fit)[1:2]))

# Should equal the observed proportion
with(df, mean(y[level == 2]))
```


# How it works

`glm4()` is a wrapper for `MatrixModels::glm4()`, which fits GLMs using iteratively reweighted least squares (IRLS) with support for both dense and sparse matrix arithmetic via the `Matrix` package (distributed with every R installation).

The raw output of `MatrixModels::glm4()` is an S4 object of class `"glpModel"` with four slots:

- `@resp` (`glmRespMod`): the response module. Key sub-slots are `@y` (response vector), `@mu` (fitted means), `@eta` (linear predictor), `@offset`, `@weights` (prior weights), and `@sqrtXwt` (square root of the IRLS working weights, stored as a matrix to accommodate the sparse case).
- `@pred` (`dPredModule` when `sparse = FALSE`, `sPredModule` when `sparse = TRUE`): the prediction module. Contains `@X` (the model matrix as a dense or sparse `Matrix` object), `@coef` (coefficient vector), and `@fac` (the Cholesky factorisation of $X^\top W X$ used at each IRLS step).
- `@call`: the original function call.
- `@fitProps`: a list of convergence diagnostics — `convcrit` (final convergence criterion), `iter` (number of IRLS iterations), and `nHalvings` (step-halvings performed).

The `glm4` package unpacks this S4 object and assembles a familiar S3 list that closely mirrors the structure of a `stats::glm` object. The original `"glpModel"` S4 fit is preserved in `fit$glm4_fit` and remains accessible for users who need lower-level access.

```{r}
class(fit$glm4_fit)
```


# Inspecting the output

`MatrixModels` provides no print, summary, or inference methods. The `glm4` package fills this gap. The summary method provides coefficients, standard errors, and test statistics that mirror a standard R GLM as closely as possible:

```{r}
summary(fit)
```

We can inspect the log-likelihood of the model like so:

```{r}
logLik(fit)
```

We can generate confidence intervals for each parameter estimate like so:

```{r}
confint(fit)[1:20,]
```

The variance-covariance matrix is returned as a sparse `Matrix` object, which is memory-efficient for large models:

```{r}
vcov(fit)[1:5, 1:5]
```


# Analysis of deviance with `anova`

## Sequential deviance: single model

Out of the box, `MatrixModels`provides no model comparison functionality. This is implemented in `glm4`. 

For a single model with multiple predictors, `anova()` adds terms one at a time and reports the reduction in residual deviance at each step (consistent with `anova.glm`). To illustrate, we add a simulated continuous covariate `x` to the data and refit:

```{r}
set.seed(6006)

# Add covariate and fit
df$x <- rnorm(nrow(df))
fit_x <- glm4(y ~ x + level, data = df, family = binomial(), sparse = TRUE)

# glm4 anova method
anova(fit_x)
```

The test statistic is chosen automatically from the family, matching `stats::anova.glm` behaviour: `"F"` for families with an estimated dispersion (Gaussian, Gamma, …) and `"Chisq"` for families with a fixed dispersion (binomial, Poisson). For binomial models this means a likelihood-ratio $\chi^2$ test. We can also request it explicitly:

```{r}
anova(fit_x, test = "Chisq")
```

The small deviance reduction for `x` (and non-significant $p$-value) is consistent with `x` being pure noise.

## Model comparison: multiple models

Passing two or more `glm4` objects to `anova()` produces a model comparison table, equivalent to a likelihood-ratio test between nested models:

```{r}
anova(fit, fit_x, test = "Chisq")
```

The output format and behaviour match `stats::anova.glm`, so results can be interpreted and handled identically to those from a standard `glm` workflow.


# Code to generate the example data

```{r, eval = FALSE}
set.seed(6006)

n_levels <- 3000

n_obs_lambda    <- rgamma(n_levels, 3.1) * 100
n_obs_per_level <- rpois(n_levels, n_obs_lambda) + 1
prob_per_level  <- rbeta(n_levels, 3.1, 4.4)

d <- vector(mode = "list", length = n_levels)
for (l in seq_along(n_obs_per_level)) {
	d[[l]] <- data.frame(
		y     = rbinom(n_obs_per_level[l], 1, prob_per_level[l]),
		level = l)
}

df       <- do.call("rbind", d)
df$level <- as.factor(df$level)
```