--- title: "Fitting a sparse GLM with glm4" author: "Angus Hughes" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Fitting a sparse GLM with glm4} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # For reproducibility set.seed(6006) # Set the number of levels n_levels <- 3000 # Randomly draw the number of observations per level n_obs_lambda <- rgamma(n_levels, 3.1) * 100 n_obs_per_level <- rpois(n_levels, n_obs_lambda) + 1 # Simulate the mean probability per level prob_per_level <- rbeta(n_levels, 3.1, 4.4) # For each level, simulate data d <- vector(mode = "list", length = n_levels) for (l in seq_along(n_obs_per_level)){ d[[l]] <- data.frame( y = rbinom(n_obs_per_level[l], 1, prob_per_level[l]), level = l) } # Combine into a data.frame df <- do.call("rbind", d) df$level <- as.factor(df$level) ``` # Getting started To demonstrate the `glm4` package, let's generate a large hypothetical dataset. This data has `r format(n_levels, big.mark = ",")` levels of a factor and a binary outcome $y$. Each level has somewhere between 1 to over 1000 observations of the outcome and an average success probability. ```{r, echo=FALSE, fig.width=7, fig.height=4, fig.align = 'center'} n_obs_total <- sum(n_obs_per_level) hist(n_obs_per_level, breaks = 80, main = paste(format(n_levels, big.mark = ","), "levels, total N =", format(n_obs_total, big.mark = ",")), xlab = "Number of observations by level") ``` We are interested in fitting a logistic regression that includes all `r format(n_levels, big.mark=",")` factor levels as fixed effects. The model matrix has `r format(n_obs_total, big.mark=",")` rows and `r format(n_levels + 1L, big.mark=",")` columns — one per factor level plus an intercept. This matrix is extremely sparse: each row contains exactly one `1` (for its factor level) and `r format(n_levels, big.mark=",")` zeros. In practice, on many machines calling `stats::glm()` here will either throw a `cannot allocate vector of size ...` error due to RAM constraints or crash the R session entirely. `glm4()` with `sparse = TRUE` stores and operates on only the non-zero elements of the model matrix, making model fitting possible. ```{r} library(glm4) run_time <- system.time( fit <- glm4(y ~ level, data = df, family = binomial(), sparse = TRUE) ) run_time ``` This took `r unname(round(run_time['elapsed'], 1))` seconds to fit. We can briefly inspect the output: ```{r} # Only print a subset of the output, restoring existing max.print after old_options <- options(max.print = 20) print(fit) options(old_options) ``` For a saturated factor model, the model-implied probability for each level equals the observed proportion exactly. We can verify this: the inverse-logit of the level-specific linear predictor should match the raw sample mean. ```{r} # Model-implied probability for level 2 plogis(sum(coef(fit)[1:2])) # Should equal the observed proportion with(df, mean(y[level == 2])) ``` # How it works `glm4()` is a wrapper for `MatrixModels::glm4()`, which fits GLMs using iteratively reweighted least squares (IRLS) with support for both dense and sparse matrix arithmetic via the `Matrix` package (distributed with every R installation). The raw output of `MatrixModels::glm4()` is an S4 object of class `"glpModel"` with four slots: - `@resp` (`glmRespMod`): the response module. Key sub-slots are `@y` (response vector), `@mu` (fitted means), `@eta` (linear predictor), `@offset`, `@weights` (prior weights), and `@sqrtXwt` (square root of the IRLS working weights, stored as a matrix to accommodate the sparse case). - `@pred` (`dPredModule` when `sparse = FALSE`, `sPredModule` when `sparse = TRUE`): the prediction module. Contains `@X` (the model matrix as a dense or sparse `Matrix` object), `@coef` (coefficient vector), and `@fac` (the Cholesky factorisation of $X^\top W X$ used at each IRLS step). - `@call`: the original function call. - `@fitProps`: a list of convergence diagnostics — `convcrit` (final convergence criterion), `iter` (number of IRLS iterations), and `nHalvings` (step-halvings performed). The `glm4` package unpacks this S4 object and assembles a familiar S3 list that closely mirrors the structure of a `stats::glm` object. The original `"glpModel"` S4 fit is preserved in `fit$glm4_fit` and remains accessible for users who need lower-level access. ```{r} class(fit$glm4_fit) ``` # Inspecting the output `MatrixModels` provides no print, summary, or inference methods. The `glm4` package fills this gap. The summary method provides coefficients, standard errors, and test statistics that mirror a standard R GLM as closely as possible: ```{r} summary(fit) ``` We can inspect the log-likelihood of the model like so: ```{r} logLik(fit) ``` We can generate confidence intervals for each parameter estimate like so: ```{r} confint(fit)[1:20,] ``` The variance-covariance matrix is returned as a sparse `Matrix` object, which is memory-efficient for large models: ```{r} vcov(fit)[1:5, 1:5] ``` # Analysis of deviance with `anova` ## Sequential deviance: single model Out of the box, `MatrixModels`provides no model comparison functionality. This is implemented in `glm4`. For a single model with multiple predictors, `anova()` adds terms one at a time and reports the reduction in residual deviance at each step (consistent with `anova.glm`). To illustrate, we add a simulated continuous covariate `x` to the data and refit: ```{r} set.seed(6006) # Add covariate and fit df$x <- rnorm(nrow(df)) fit_x <- glm4(y ~ x + level, data = df, family = binomial(), sparse = TRUE) # glm4 anova method anova(fit_x) ``` The test statistic is chosen automatically from the family, matching `stats::anova.glm` behaviour: `"F"` for families with an estimated dispersion (Gaussian, Gamma, …) and `"Chisq"` for families with a fixed dispersion (binomial, Poisson). For binomial models this means a likelihood-ratio $\chi^2$ test. We can also request it explicitly: ```{r} anova(fit_x, test = "Chisq") ``` The small deviance reduction for `x` (and non-significant $p$-value) is consistent with `x` being pure noise. ## Model comparison: multiple models Passing two or more `glm4` objects to `anova()` produces a model comparison table, equivalent to a likelihood-ratio test between nested models: ```{r} anova(fit, fit_x, test = "Chisq") ``` The output format and behaviour match `stats::anova.glm`, so results can be interpreted and handled identically to those from a standard `glm` workflow. # Code to generate the example data ```{r, eval = FALSE} set.seed(6006) n_levels <- 3000 n_obs_lambda <- rgamma(n_levels, 3.1) * 100 n_obs_per_level <- rpois(n_levels, n_obs_lambda) + 1 prob_per_level <- rbeta(n_levels, 3.1, 4.4) d <- vector(mode = "list", length = n_levels) for (l in seq_along(n_obs_per_level)) { d[[l]] <- data.frame( y = rbinom(n_obs_per_level[l], 1, prob_per_level[l]), level = l) } df <- do.call("rbind", d) df$level <- as.factor(df$level) ```