--- title: "Bandwidth Selection and Conley Spatial HAC Standard Errors" author: "Alexander Lehner" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{spatial_HAC} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, comment = "#>" ) ``` ## Overview The `SpatialInference` package provides tools for statistical inference with spatially correlated data. Its main features are: 1. **Conley spatial HAC standard errors** — a computationally efficient implementation [based on C++ code by Darin Christensen](https://github.com/darinchristensen/conley-se) [@ChristensenHartmanSamii2021] of the estimator introduced by @Conley1999. 2. **Covariogram-based bandwidth selection** — a data-driven method for choosing the kernel bandwidth, based on extracting the correlation range from the empirical covariogram of regression residuals, as proposed by @Lehner2026. 3. **Diagnostic visualizations** — including the inverse-U relationship between bandwidth and Conley SE magnitude [@Lehner2026], and covariogram plots with annotated range estimates. ## Setup Load the package and data containing the centroids of all 3,108 counties of the contiguous United States: ```{r setup, fig.width=7} library(SpatialInference) library(sf); library(ggplot2) library(lfe) library(modelsummary) data("US_counties_centroids") ggplot(US_counties_centroids) + geom_sf(size = .1) + theme_bw() ``` ## Simulated spatially correlated data The dataset contains two independently generated spatially correlated noise variables (`noise1` and `noise2`), drawn using geostatistical simulation. Although drawn independently, both exhibit strong spatial clustering — nearby counties have similar values: ```{r, fig.show='hold', fig.width=7} ggplot(US_counties_centroids) + geom_sf(aes(col = noise1), size = .1) + theme_bw() ggplot(US_counties_centroids) + geom_sf(aes(col = noise2), size = .1) + theme_bw() ``` For comparison, the `dist` variable is an example of spatial non-stationarity (a spatial trend rather than spatial autocorrelation): ```{r} ggplot(US_counties_centroids) + geom_sf(aes(col = dist), size = .1) + theme_bw() ``` ## The spurious regression problem Even though the two noise variables were drawn independently (so the true coefficient is zero), a naive regression finds a highly significant relationship: ```{r} spuriouslm <- fixest::feols(noise1 ~ noise2, data = US_counties_centroids, vcov = "HC1") spuriouslm ``` This is because the OLS errors $\varepsilon_i$ are not i.i.d. — nearby residuals are positively correlated, which inflates the $t$-statistic. A map of the regression residuals reveals the spatial clustering: ```{r} US_counties_centroids$resid <- spuriouslm$residuals ggplot(US_counties_centroids) + geom_sf(aes(col = resid), size = .1) + theme_bw() + scale_color_viridis_c() ``` ## Estimating the correlation range Before computing Conley standard errors, we need to choose a bandwidth: the distance within which residuals are allowed to be correlated. The `covgm_range()` function estimates this from the data by computing the empirical covariogram of the regression residuals and identifying the distance at which covariation first crosses zero [@Lehner2026; @Pebesma2004]. ```{r, fig.width=7} covgm_range(US_counties_centroids) + theme_bw() ``` The plot shows the empirical covariogram — each point represents the average cross-product of residual pairs within a distance bin. The red vertical line marks the estimated correlation range $\hat{\varsigma}$ (here around 831 km), and the red dashed line marks zero covariance. Beyond the estimated range, the covariogram fluctuates around zero, indicating that residual correlation has effectively ceased. This is the key input for the Conley standard error: the bandwidth should match the distance over which residuals are actually correlated. To extract just the numeric range for programmatic use: ```{r} # Compute the covariogram manually covgm <- gstat::variogram( spuriouslm$residuals ~ 1, data = sf::as_Spatial(US_counties_centroids), covariogram = TRUE, width = 2e4, cutoff = as.numeric(max(sf::st_distance(US_counties_centroids))) * (2/3) ) estimated_range <- extract_corr_range(covgm) estimated_range ``` ## The inverse-U relationship A critical insight documented by @Lehner2026 is that the relationship between the bandwidth and the magnitude of Conley standard errors follows an *inverse-U shape*. This means that both too narrow and too wide bandwidths lead to underestimated standard errors — contradicting the conventional wisdom that wider bandwidths are always more conservative. The `inverseu_plot_conleyrange()` function visualises this: ```{r, fig.width=7} inverseu_plot_conleyrange(US_counties_centroids, seq(1, 2501, by = 200)) + theme_bw() ``` The grey dashed line represents the HC1 standard error. As the cutoff distance increases from zero, the Conley SE first *rises* (as positively correlated residual pairs are included), reaches a peak near the true correlation range, and then *declines* (as uncorrelated pairs dilute the estimate). At very wide bandwidths, the Conley SE can fall *below* the HC1 level — making wide-bandwidth Conley errors less conservative than heteroskedasticity-robust errors. ## Computing Conley standard errors With the estimated range in hand, we compute the spatial HAC standard error using `conley_SE()`: ```{r} spuriouslm_fe <- felm(noise1 ~ noise2 | unit + year | 0 | lat + lon, data = US_counties_centroids, keepCX = TRUE) regfe_conley <- conley_SE(reg = spuriouslm_fe, unit = "unit", time = "year", lat = "lat", lon = "lon", kernel = "epanechnikov", dist_fn = "Haversine", dist_cutoff = 831) conley <- sapply(regfe_conley, function(x) diag(sqrt(x)))[2] %>% round(5) conley ``` The Conley SE at the estimated bandwidth is substantially larger than the HC1 error. Constructing a corrected $t$-statistic: ```{r} spuriouslm_fe$coefficients[1] / as.numeric(conley) ``` The $t$-statistic is far below 1.96, so we correctly *fail to reject* the null hypothesis — as expected, given that the two variables are genuinely independent. ## Convenience wrappers The `compute_conley_lfe()` function provides a shortcut: ```{r} compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov") ``` For a complete workflow — regression, Moran's I test, and Conley SEs in one call — use `lm_sac()`: ```{r} lmsac.out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon", US_counties_centroids, conley_cutoff = 831, conley_kernel = "epanechnikov") lmsac.out$conley_SE gm.param <- list( list("raw" = "nobs", "clean" = "Observations", "fmt" = 0), list("raw" = "Moran_y", "clean" = "Moran's I [y]", "fmt" = 3), list("raw" = "Moran_resid", "clean" = "Moran's I [resid]", "fmt" = 3), list("raw" = "y_mean", "clean" = "Mean [y]", "fmt" = 3), list("raw" = "y_SD", "clean" = "Std.Dev. [y]", "fmt" = 3) ) modelsummary(lmsac.out, estimate = c("{estimate}"), statistic = c("({std.error})", "([{conley}])"), gof_omit = "Model|Range_resid|Range_resp|Range_y", gof_map = gm.param) ``` ## Kernel choices The spatial HAC estimate is sensitive not only to the bandwidth but also to the kernel function. The package includes six kernels with different distance-decay profiles. The uniform kernel weights all pairs equally within the cutoff; the others downweight more distant pairs with varying decay rates: ```{r} compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "uniform") compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov") compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "bartlett") compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "parzen") compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "gaussian") compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "biweight") ``` Based on Monte Carlo evidence in @Lehner2026, the Bartlett and Epanechnikov kernels tend to deliver the best size control in practice. ## References