--- title: "Introduction to gkrreg: Gaussian Kernel Robust Regression" author: "Eufrásio de Andrade Lima Neto, Marcelo Rodrigo Portela Ferreira" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to gkrreg: Gaussian Kernel Robust Regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4, fig.align = "center" ) set.seed(1) ``` ## Overview The **gkrreg** package implements the *Gaussian Kernel Robust Regression* (GKRReg, or GKRR) method proposed by @decarvalho2017. The method fits a linear regression model by iteratively re-weighting observations using the Gaussian kernel, so that outliers and leverage points are automatically down-weighted. Convergence is theoretically guaranteed (Propositions 4.1 and 4.2 of the paper). The package provides: - `gkrr()` — model fitting with four width hyper-parameter estimators (`"s1"`, `"s2"`, `"s3"`, `"s4"`) plus automatic data-driven selection (`"auto"`); - `gkrr_boot()` — bootstrap inference (standard errors, confidence intervals and p-values); - `summary.gkrr()` — a coefficient table modelled after `summary.lm()`, with sandwich inference by default and bootstrap inference when available; - `plot.gkrr()` — six diagnostic panels emphasising kernel weights; - Six real datasets from the robust regression literature. --- ## The GKRR method ### Model and objective function GKRReg minimises $$S(\boldsymbol{\beta}) = 2\sum_{i=1}^n \bigl[1 - G(y_i,\, \hat{\mu}_i)\bigr], \qquad G(a,b) = \exp\!\bigl(-{(a-b)^2}/{\gamma^2}\bigr),$$ where $\hat{\mu}_i = \mathbf{x}_i^\top\boldsymbol{\beta}$ and $\gamma^2 > 0$ is the kernel width hyper-parameter. An observation with a large residual contributes close to 2 to $S$; a perfectly fitted observation contributes 0. ### Estimation algorithm Differentiating $S$ yields an IRWLS problem with kernel weights $k_{ii}^{(t)} = G(y_i, \hat{\mu}_i^{(t-1)}) \in (0,1]$. The algorithm starts from OLS and alternates between WLS and weight updates until convergence (Propositions 4.1–4.2 of @decarvalho2017). ### Width hyper-parameter $\gamma^2$ Five options are available via the `sigma_method` argument: | Method | Description | Recommended scenario | |--------|-------------|----------------------| | `"s1"` | Mean of the 0.1 and 0.9 quantiles of OLS squared residuals on a sub-sample (Caputo estimator) | Clean data | | `"s2"` | Pairwise median of $(y_i - \hat{\mu}_j^{\text{OLS}})^2$, $i \neq j$ | Y-space outliers $\leq 10\%$; X-space outliers $\leq 15\%$ | | `"s3"` | Residual variance $\sum(y_i - \hat{\mu}_i)^2 / (n - p - 1)$ | Y-space outliers $\geq 15\%$; leverage points | | `"s4"` | AICc-selected bandwidth $h^2$ via `sm::h.select()` | Large samples ($n \geq 200$) | | `"auto"` | Selects among S1–S4 by out-of-bag bootstrap MSE | When the outlier scenario is unknown | When `"auto"` is used, a message is emitted showing which method was selected and the comparative OOB MSEs. The selection uses $B = 99$ replicates by default, configurable via `auto_args = list(B = ...)`. --- ## Basic usage ```{r basic} library(gkrreg) data(belgium_calls) fit <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3") print(fit) ``` `summary()` uses the **sandwich variance estimator** by default, producing standard errors, confidence intervals and Wald z-test p-values with no additional computation: ```{r summary-sandwich} summary(fit) ``` The sandwich covariance matrix is accessible via `vcov()`: ```{r vcov} round(vcov(fit), 6) ``` --- ## Statistical inference ### Sandwich variance estimator (default) The original paper does not provide a sampling distribution for $\hat{\boldsymbol{\beta}}$. The WLS covariance matrix $(X^\top \hat{K} X)^{-1}$ from the final IRWLS step treats $\hat{K}$ as fixed and therefore underestimates the true variance. **gkrreg** implements an analytic *sandwich variance estimator* based on the theory of generalised $M$-estimators. At convergence the GKRReg estimating equations are: $$\sum_{i=1}^n k_{ii}(\boldsymbol{\beta})\,\mathbf{x}_i (y_i - \mathbf{x}_i^\top\boldsymbol{\beta}) = \mathbf{0},$$ and the asymptotic sandwich covariance is $n^{-1}\hat{A}^{-1}\hat{B}\hat{A}^{-1}$, where $$\hat{A} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left[k_{ii}\!\left(1 - \frac{2\hat{e}_i^2}{\hat{\gamma}^2}\right)\right] \mathbf{X}, \qquad \hat{B} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left(k_{ii}^2\,\hat{e}_i^2\right)\mathbf{X}.$$ This corresponds to the HC0 class of heteroskedasticity-robust covariance estimators [@white1982]. The correction term $(1 - 2\hat{e}_i^2/\hat{\gamma}^2)$ in $\hat{A}$ arises directly from differentiating the Gaussian kernel with respect to $\boldsymbol{\beta}$. **When is the sandwich reliable?** The sandwich is consistent and efficient asymptotically, but treats $\hat{\gamma}^2$ as fixed. For small samples ($n < 50$) or heavy contamination, bootstrap inference is recommended. `summary()` emits a note when these conditions are detected automatically. ### Bootstrap inference `gkrr_boot()` runs a *pairs bootstrap*, re-fitting the complete GKRReg algorithm — including re-estimating $\gamma^2$ — on each replicate. This captures all sources of variability and is more reliable than the sandwich for small samples or heavy contamination. Three CI types are available: `"percentile"`, `"normal"`, and `"bca"` (bias-corrected and accelerated, @efron1987; the default and recommended). Bootstrap p-values use the centred-t approach: $$p_j = \frac{2}{B}\,\#\!\left\{|\hat\beta_j^* - \hat\beta_j| \geq |\hat\beta_j|\right\}.$$ **Option A — `boot = TRUE` inside `gkrr()`** ```{r boot-true, eval = FALSE} fit_b <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3", boot = TRUE, boot_args = list(B = 999, type = "bca", seed = 1)) summary(fit_b) ``` **Option B — separate `gkrr_boot()` call** ```{r boot-separate, eval = FALSE} boot <- gkrr_boot(fit, B = 999, type = "bca", seed = 1) summary(fit, boot = boot) plot(boot, which = 1) # histogram + shaded CI per coefficient plot(boot, which = 2) # bootstrap scatter-plot matrix ``` ### Choosing between sandwich and bootstrap | Scenario | Recommended | |----------|-------------| | $n \geq 50$, mild contamination | Sandwich (fast, deterministic) | | $n < 50$ or heavy contamination | Bootstrap | | Both available and SEs diverge by > 25% | Bootstrap | When both are available, `summary()` automatically warns if the relative discrepancy in standard errors exceeds 25% for any coefficient. --- ## Diagnostic plots `plot.gkrr()` provides six panels. **Point size is inversely proportional to $k_{ii}$**: outliers appear large and red; well-fitted observations appear small and blue. ```{r plots, fig.width = 7, fig.height = 3} oldpar <- par(no.readonly = TRUE) on.exit(par(oldpar)) par(mfrow = c(1, 3)) plot(fit, which = 1, ask = FALSE) # residuals vs. fitted plot(fit, which = 3, ask = FALSE) # weight vs. residual + kernel curve plot(fit, which = 4, ask = FALSE) # weight vs. index par(oldpar) ``` Panel 3 overlays the theoretical curve $G(e) = \exp(-e^2/\hat{\gamma}^2)$, making the down-weighting mechanism directly visible. Observations 15–20 in `belgium_calls` (a recording error) receive near-zero weights and are effectively excluded from the fit. | `which` | Description | |---------|-------------| | 1 | Residuals vs. fitted values | | 2 | Observed vs. fitted ($y$ vs. $\hat{\mu}$) | | 3 | Kernel weight vs. residual with theoretical curve | | 4 | Kernel weight vs. observation index | | 5 | Normal QQ-plot of residuals, coloured by weight | | 6 | Convergence of $S(\boldsymbol{\beta})$ | --- ## Comparison with other methods ```{r comparison, message = FALSE, fig.width = 6, fig.height = 4} data(kootenay) fit_ols <- lm(newgate ~ libby, data = kootenay) fit_gkrr <- gkrr(newgate ~ libby, data = kootenay, sigma_method = "s1") if (requireNamespace("MASS", quietly = TRUE)) { fit_m <- MASS::rlm(newgate ~ libby, data = kootenay, method = "M") fit_mm <- MASS::rlm(newgate ~ libby, data = kootenay, method = "MM") } else { fit_m <- fit_mm <- NULL } tab <- rbind(OLS = coef(fit_ols), GKRR = coef(fit_gkrr)) if (!is.null(fit_m)) tab <- rbind(tab, M = coef(fit_m), MM = coef(fit_mm)) print(round(tab, 4)) plot(kootenay$libby, kootenay$newgate, xlab = "Libby flow", ylab = "Newgate flow", main = "Kootenay River -- X-space outlier (1934)", pch = 19, col = "grey60") points(kootenay["1934","libby"], kootenay["1934","newgate"], col = "red", pch = 17, cex = 1.6) abline(fit_ols, col = "black", lwd = 2, lty = 2) abline(fit_gkrr, col = "firebrick", lwd = 2) if (!is.null(fit_m)) { abline(fit_m, col = "darkorange", lwd = 2, lty = 3) abline(fit_mm, col = "purple", lwd = 2, lty = 4) legend("topleft", c("OLS","GKRR","M","MM","1934"), col = c("black","firebrick","darkorange","purple","red"), lty = c(2,1,3,4,NA), pch = c(NA,NA,NA,NA,17), lwd = 2, bty = "n") } else { legend("topleft", c("OLS","GKRR","1934"), col = c("black","firebrick","red"), lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n") } ``` --- ## Real-data applications ### Belgium international calls — Y-space outliers ```{r belgium, fig.width = 6, fig.height = 4} fit_ols <- lm(calls ~ year, data = belgium_calls) fit_gkrr <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3") plot(belgium_calls$year + 1900, belgium_calls$calls, xlab = "Year", ylab = "Calls (tens of millions)", main = "Belgium International Calls", pch = 19, col = "grey60") points(belgium_calls$year[15:20] + 1900, belgium_calls$calls[15:20], col = "red", pch = 17, cex = 1.4) abline(fit_ols, col = "black", lwd = 2, lty = 2) abline(fit_gkrr, col = "firebrick", lwd = 2) legend("topleft", c("OLS","GKRR (S3)","Outliers (1964-69)"), col = c("black","firebrick","red"), lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n") ``` ```{r belgium-diag, fig.width = 7, fig.height = 3} oldpar <- par(no.readonly = TRUE) on.exit(par(oldpar)) par(mfrow = c(1, 3)) plot(fit_gkrr, which = 1, ask = FALSE) plot(fit_gkrr, which = 3, ask = FALSE) plot(fit_gkrr, which = 4, ask = FALSE) par(oldpar) ``` ### Delivery time — leverage points in multiple regression ```{r delivery} data(delivery) fit_ols <- lm(delivery_time ~ n_products + distance, data = delivery) fit_gkrr <- gkrr(delivery_time ~ n_products + distance, data = delivery, sigma_method = "s3") round(rbind(OLS = coef(fit_ols), GKRR = coef(fit_gkrr)), 4) ``` ```{r delivery-diag, fig.width = 7, fig.height = 3} oldpar <- par(no.readonly = TRUE) on.exit(par(oldpar)) par(mfrow = c(1, 3)) plot(fit_gkrr, which = 2, ask = FALSE) plot(fit_gkrr, which = 4, ask = FALSE) plot(fit_gkrr, which = 5, ask = FALSE) par(oldpar) ``` ### Mammals — leverage on the log scale ```{r mammals, fig.width = 6, fig.height = 4} data(mammals) fit_ols <- lm(log_brain ~ log_body, data = mammals) fit_gkrr <- gkrr(log_brain ~ log_body, data = mammals, sigma_method = "s3") plot(mammals$log_body, mammals$log_brain, xlab = "log(body mass, kg)", ylab = "log(brain mass, g)", main = "Brain vs. Body Mass (62 Mammal Species)", pch = 19, col = "grey60", cex = 0.8) elephants <- mammals$species %in% c("African elephant","Asian elephant") points(mammals$log_body[elephants], mammals$log_brain[elephants], col = "red", pch = 17, cex = 1.5) abline(fit_ols, col = "black", lwd = 2, lty = 2) abline(fit_gkrr, col = "firebrick", lwd = 2) legend("topleft", c("OLS","GKRR (S3)","Elephants"), col = c("black","firebrick","red"), lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n") ``` --- ## Session information ```{r session} sessionInfo() ``` ## References