--- title: "A brief guide to GAReg" author: "Mo Li, QiQi Lu, Robert Lund, Xueheng Shi" date: "`r format(Sys.time(), '%d %b %Y')`" output: rmarkdown::html_vignette: number_sections: false toc: true vignette: > %\VignetteIndexEntry{A brief guide to GAReg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) set.seed(12345) ``` ## Overview **GAreg** implements genetic algorithm (GA) optimization for three core statistical workflows where discrete model structure must be learned from data: 1. Best subset variable selection – choose a parsimonious set of predictors under information or validation criteria. 2. Changepoint detection – estimate the number and locations of structural breaks in regression or time series. 3. Optimal spline knot selection – place interior knots for splines or piecewise polynomials to balance fit and smoothness. The package provides a consistent GA interface, unified S4 results (gareg), and penalty-aware objectives. It supports both varying‑knot and fixed‑knot modes, minimum spacing constraints, and unbalanced designs. ```{r setup} library(GAReg) ``` ## Best subset variable selection We use the simulation function below for subset selection illustration. Here, `n` is the number of observations and `p` is the number of predictors. For the covairates, `s0` represent number of truly active predictors, valued range from `0` to `p`. `magnitudes_range` specifies the range of significantly expressed coefficients that corresponding to the truly active predictors. If `rho` is specified with some values, the autoregressive structure is introduced into the error terms. If `rho=NULL`, we will have independent and identically distributed (IID) errors. We can also specify `sigma` for the error standard deviation. ```{r} sim_subset_data <- function(n = 60, p = 50, s0 = 25, sigma = 1.5, magnitudes_range = c(0.5, 2), rho = NULL, seed = NULL) { stopifnot(n > 0, p > 0, s0 >= 0, s0 <= p, sigma >= 0) if (!is.null(seed)) set.seed(seed) X <- matrix(rnorm(n * p), n, p) # Active set and coefficients true_idx <- if (s0 > 0) sort(sample.int(p, s0)) else integer(0) signs <- if (s0 > 0) sample(c(-1, 1), s0, replace = TRUE) else numeric(0) magnitudes <- if (s0 > 0) runif(s0, magnitudes_range[1], magnitudes_range[2]) else numeric(0) beta_true <- numeric(p) if (s0 > 0) beta_true[true_idx] <- magnitudes * signs if (is.null(rho)) { e <- rnorm(n, sd = sigma) } else { sd_innov <- sigma * sqrt(1 - rho^2) burn_in <- 100 z <- rnorm(n + burn_in, sd = sd_innov) e_full <- numeric(n + burn_in) for (t in 2:(n + burn_in)) e_full[t] <- rho * e_full[t - 1] + z[t] e <- e_full[(burn_in + 1):(burn_in + n)] } y <- as.numeric(X %*% beta_true + e) DF <- data.frame(y = y, as.data.frame(X)) colnames(DF)[-1] <- paste0("X", seq_len(p)) list( X = X, y = y, beta_true = beta_true, true_idx = true_idx, DF = DF, rho = if (is.null(rho)) NULL else rho, args = list( n = n, p = p, s0 = s0, sigma = sigma, magnitudes_range = magnitudes_range, rho = rho, seed = seed ) ) } ``` ```{r} sim <- sim_subset_data(n = 100, p = 50, s0 = 25, sigma = 1.5, rho = NULL, seed = 123) y <- sim$y X <- sim$X ga <- gareg_subset( y = y, X = X, gaMethod = "GA", monitor = FALSE, gacontrol = list( popSize = 120, maxiter = 20000, run = 4000, pmutation = 0.02 ) ) summary(ga) res <- FDRCalc(truelabel = sim$true_idx, predlabel = ga@bestsol, N = 50) # FALSE Discover Rate res$fdr # TRUE Positive Rate res$tpr ``` ## Changepoint detection The multiple changepoint detection can be conducted through `changepointGA` package (Li and Lu, 2024). The BIC penalized function is provided below for IID data. The related math details can be found in (Li et al., 2026). ```{r} BIC.cpt <- function(chromosome, Xt) { m <- chromosome[1] tau <- chromosome[2:(2 + m - 1)] N <- length(Xt) if (m == 0) { mu.hat <- mean(Xt) sigma.hatsq <- sum((Xt - mu.hat)^2) / N BIC.obj <- 0.5 * N * log(sigma.hatsq) + 2 * log(N) } else { tau.ext <- c(1, tau, N + 1) seg.len <- diff(tau.ext) ff <- rep(0:m, times = seg.len) Xseg <- split(Xt, ff) mu.seg <- unlist(lapply(Xseg, mean), use.names = F) mu.hat <- rep(mu.seg, seg.len) sigma.hatsq <- sum((Xt - mu.hat)^2) / N BIC.obj <- 0.5 * N * log(sigma.hatsq) + (2 * m + 2) * log(N) } return(BIC.obj) } # IID data set.seed(1234) n <- 200 et <- rnorm(n) Xt <- et + rep(c(0, 2, 0, 2), each = n / 4) library(changepointGA) GA.res <- cptga( ObjFunc = BIC.cpt, N = n, popSize = 500, pcrossover = 0.95, pmutation = 0.3, pchangepoint = 10 / n, Xt = Xt ) summary(GA.res) ``` ## Optimal spline knot selection ### Example data set This section will illustrate optimal spline knot placement on the classic motocycle impact dataset from Package \code{MASS} (Venables & Ripley, 2002).\code{MASS::mcycle} contains 133 observations from a simulated motorcycle crash test, recording head acceleration (in g) of a helmeted test subject over time (milliseconds). The series is non-linear with sharp transients and heteroskedastic noise, which makes it a canonical benchmark for smoothing and spline-based regression. We use function \code{gareg_knots} choose interior spline knots subject to a minimum separation in indices to avoid clustering. Here the `truncated-power piecewise polynomials` with polynomial degree of 3 is the used default. ```{r} library(MASS) library(splines) data(mcycle) head(mcycle) ``` ### Different GA model set-up - Varying knots with single population GA; ```{r} g1 <- gareg_knots( y = mcycle$accel, x = mcycle$times, minDist = 5, gaMethod = "cptga", cptgactrl = cptgaControl(popSize = 200, pcrossover = 0.9, pmutation = 0.3), ic_method = "BIC" ) summary(g1) # knots location g1@bestsol ``` - Varying knots with island model GA; ```{r} g2 <- gareg_knots( y = mcycle$accel, x = mcycle$times, minDist = 5, gaMethod = "cptgaisl", cptgactrl = cptgaControl( numIslands = 5, popSize = 200, maxMig = 250, pcrossover = 0.9, pmutation = 0.3 ), ic_method = "BIC" ) summary(g2) ``` - Fixed knots with single population GA; ```{r} g3 <- gareg_knots( y = mcycle$accel, x = mcycle$times, fixedknots = 3, minDist = 5, gaMethod = "cptga", cptgactrl = cptgaControl(popSize = 200, pcrossover = 0.9, pmutation = 0.3), ic_method = "BIC" ) summary(g3) ``` - Fixed knots with island model GA; ```{r} g4 <- gareg_knots( y = mcycle$accel, x = mcycle$times, fixedknots = 4, minDist = 5, gaMethod = "cptgaisl", cptgactrl = cptgaControl( numIslands = 5, popSize = 200, maxMig = 250, pcrossover = 0.9, pmutation = 0.3 ), ic_method = "BIC" ) summary(g4) ``` ```{r, fig.width=7, fig.height=5, out.width="90%", fig.align='center'} y <- mcycle$accel x <- mcycle$times x_unique <- unique(x) tBIC.vary.ga <- g1@bestsol tBIC.vary.gaisl <- g2@bestsol tBIC.fix.3.ga <- g3@bestsol tBIC.fix.4.gaisl <- g4@bestsol bsfit.vary.ga <- lm(y ~ bs(x, knots = x_unique[g1@bestsol], Boundary.knots = range(x))) bsfit.vary.gaisl <- lm(y ~ bs(x, knots = x_unique[g2@bestsol], Boundary.knots = range(x))) bsfit.fix.3.ga <- lm(y ~ bs(x, knots = x_unique[g3@bestsol], Boundary.knots = range(x))) bsfit.fix.4.gaisl <- lm(y ~ bs(x, knots = x_unique[g4@bestsol], Boundary.knots = range(x))) plot(x, y, xlab = "Time (ms)", ylab = "Acceleration (g)") ht <- seq(min(x), max(x), length.out = 200) lines(ht, predict(bsfit.vary.ga, data.frame(x = ht)), col = "blue", lty = 5, lwd = 2) lines(ht, predict(bsfit.vary.gaisl, data.frame(x = ht)), col = "orange", lty = 4, lwd = 2) lines(ht, predict(bsfit.fix.3.ga, data.frame(x = ht)), col = "purple", lty = 3, lwd = 2) lines(ht, predict(bsfit.fix.4.gaisl, data.frame(x = ht)), col = "#D55E00", lty = 2, lwd = 2) legend("bottomright", legend = c( "Varying knots GA", "Varying knots island model GA", "Fixed 3 knots GA", "Fixed 4 knots island model GA" ), lty = 5:2, lwd = 2, col = c("blue", "orange", "purple", "#D55E00"), bty = "n" ) ``` ### Spline options: piecewise polynomials, natural cubic, and B-splines This section illustrates how to build spline design matrices via \code{splineX()} for three common options: - \code{type="ppolys"}: degree-$d$ truncated power piecewise polynomials; - \code{type="ns"}: degree-3 natural cubic spline (degree will be ignored); - \code{type="bs"}: degree-$d$ B-spline basis; We’ll use the motorcycle acceleration data \code{MASS::mcycle}, create interior knots at quantiles of \code{times}, and compare how different spline types/degrees behave. Here, we only illustrate through `Varying number and locations of knots` set-up (Let GA choose both **how many** knots and **where** they go). ```{r vary-ppolys} g_pp3 <- gareg_knots( y = y, x = x, minDist = 5, gaMethod = "cptga", ObjFunc = NULL, # use default varyknotsIC type = "ppolys", degree = 3, # degree-3 piecewise polynomial intercept = TRUE, ic_method = "BIC" ) summary(g_pp3) ``` ```{r vary-ns} g_ns <- gareg_knots( y = y, x = x, minDist = 5, gaMethod = "cptga", type = "ns", # natural cubic (degree ignored) degree = 3, # ignored for "ns" intercept = TRUE, ic_method = "BIC" ) summary(g_ns) ``` ```{r vary-bs} g_bs1 <- gareg_knots( y = y, x = x, minDist = 5, gaMethod = "cptga", type = "bs", degree = 1, # linear B-splines intercept = TRUE, ic_method = "BIC" ) summary(g_bs1) ``` ## Reproducibility ```{r} sessionInfo() ``` ## References Li, M., & Lu, Q. (2025). changepointGA: An R package for Fast Changepoint Detection via Genetic Algorithm. arXiv preprint arXiv:2410.15571. Mo Li, QiQi Lu, Robert Lund, & Xueheng Shi. (2026). Genetic Algorithms in Regression. In preparation. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.