Calculating residual spatial autocorrelation

Perhaps the most famous sentence in spatial analysis is Tobler’s first law of geography, from Tobler (1970): “Everything is related to everything else, but near things are more related than distant things.” Spatial data often exhibits spatial autocorrelation, where variables of interest are not distributed at random but rather exhibit spatial patterns; in particular, spatial data is often clustered (exhibiting positive spatial autocorrelation) such that locations near each other are more similar than you’d expect if you had just sampled two observations at random.

For some data, this makes intuitive sense. The elevation at two neighboring points is extremely likely to be similar, as is the precipitation and temperature; these are variables whose values depend on (among other things) your position on the Earth. However, the first law is often over-interpreted. Pebesma and Bivand (2022) present an interesting discussion of the “first law”, quoting Olsson (1970) who says:

[T]he fact that the autocorrelations seem to hide systematic specification errors suggests that the elevation of this statement to the status of ‘the first law of geography’ is at best premature. At worst, the statement may represent the spatial variant of the post hoc fallacy, which would mean that coincidence has been mistaken for a causal relation.

Oftentimes, finding spatial autocorrelation in a variable is a result of that variable depending on other variables, which may or may not be spatially dependent themselves. For instance, house prices often exhibit positive autocorrelation, not because home prices are determined by their relative position on Earth, but because house prices rely upon other variables – school zones, median income, housing availability and more – which may themselves be spatially autocorrelated.

For that reason, it’s often worthwhile to look at the spatial autocorrelation of model residuals, to see if your model makes more errors in certain regions than you’d expect if errors were randomly arranged. That can help you to identify misspecifications in your model: seeing large autocorrelations in model residuals in an area might suggest that you’re missing variables in your model, and knowing which areas your model does worse in can help you to identify what those variables might be. Even if you can’t fix your model, it’s often useful to identify regions your model does notably worse in, so that you can communicate that to whoever winds up using your predictions.

Let’s walk through how we can use waywiser to find local indicators of spatial autocorrelation for a very simple model. First things first, let’s load a few libraries:

# waywiser itself, of course:
library(waywiser)
# For the %>% pipe and mutate:
library(dplyr)

We’ll be working with the guerry data included in waywiser package. We’ll fit a simple linear model relating crimes against persons with literacy, and then generate predictions from that model. We can use ww_local_moran_i() to calculate the local spatial autocorrelation of our residuals at each data point:

guerry %>%
  mutate(pred = predict(lm(Crm_prs ~ Litercy, .))) %>%
  ww_local_moran_i(Crm_prs, pred)
#> # A tibble: 85 × 3
#>    .metric       .estimator .estimate
#>    <chr>         <chr>          <dbl>
#>  1 local_moran_i standard      0.530 
#>  2 local_moran_i standard      0.858 
#>  3 local_moran_i standard      0.759 
#>  4 local_moran_i standard      0.732 
#>  5 local_moran_i standard      0.207 
#>  6 local_moran_i standard      0.860 
#>  7 local_moran_i standard      0.692 
#>  8 local_moran_i standard      1.69  
#>  9 local_moran_i standard     -0.0109
#> 10 local_moran_i standard      0.710 
#> # ℹ 75 more rows

If you’re familiar with spdep, you can probably guess that waywiser is doing something under the hood here to calculate which of our observations are neighbors, and how to create spatial weights from those neighborhoods. And that guess would be right – waywiser is making use of two functions, ww_build_neighbors() and ww_build_weights(), in order to automatically calculate spatial weights for calculating metrics:

ww_build_neighbors(guerry)
#> Neighbour list object:
#> Number of regions: 85 
#> Number of nonzero links: 420 
#> Percentage nonzero weights: 5.813149 
#> Average number of links: 4.941176

ww_build_weights(guerry)
#> Characteristics of weights list object:
#> Neighbour list object:
#> Number of regions: 85 
#> Number of nonzero links: 420 
#> Percentage nonzero weights: 5.813149 
#> Average number of links: 4.941176 
#> 
#> Weights style: W 
#> Weights constants summary:
#>    n   nn S0      S1       S2
#> W 85 7225 85 37.2761 347.6683

These functions aren’t always the best way to calculate spatial weights for your data, however. As a result, waywiser also lets you specify your own weights directly:

weights <- guerry %>%
  sf::st_geometry() %>%
  sf::st_centroid() %>%
  spdep::dnearneigh(0, 97000) %>%
  spdep::nb2listw()

weights
#> Characteristics of weights list object:
#> Neighbour list object:
#> Number of regions: 85 
#> Number of nonzero links: 314 
#> Percentage nonzero weights: 4.346021 
#> Average number of links: 3.694118 
#> 
#> Weights style: W 
#> Weights constants summary:
#>    n   nn S0       S1       S2
#> W 85 7225 85 51.86738 348.7071

guerry %>%
  mutate(pred = predict(lm(Crm_prs ~ Litercy, .))) %>%
  ww_local_moran_i(Crm_prs, pred, weights)
#> # A tibble: 85 × 3
#>    .metric       .estimator .estimate
#>    <chr>         <chr>          <dbl>
#>  1 local_moran_i standard    0.530   
#>  2 local_moran_i standard    0.794   
#>  3 local_moran_i standard    0.646   
#>  4 local_moran_i standard    0.687   
#>  5 local_moran_i standard    0.207   
#>  6 local_moran_i standard    1.49    
#>  7 local_moran_i standard    0.692   
#>  8 local_moran_i standard    1.69    
#>  9 local_moran_i standard   -0.000610
#> 10 local_moran_i standard    0.859   
#> # ℹ 75 more rows

Or as a function, which lets you use custom weights with other tidymodels functions like fit_resamples():

weights_function <- function(data) {
  data %>%
    sf::st_geometry() %>%
    sf::st_centroid() %>%
    spdep::dnearneigh(0, 97000) %>%
    spdep::nb2listw()
}

guerry %>%
  mutate(pred = predict(lm(Crm_prs ~ Litercy, .))) %>%
  ww_local_moran_i(Crm_prs, pred, weights_function)
#> # A tibble: 85 × 3
#>    .metric       .estimator .estimate
#>    <chr>         <chr>          <dbl>
#>  1 local_moran_i standard    0.530   
#>  2 local_moran_i standard    0.794   
#>  3 local_moran_i standard    0.646   
#>  4 local_moran_i standard    0.687   
#>  5 local_moran_i standard    0.207   
#>  6 local_moran_i standard    1.49    
#>  7 local_moran_i standard    0.692   
#>  8 local_moran_i standard    1.69    
#>  9 local_moran_i standard   -0.000610
#> 10 local_moran_i standard    0.859   
#> # ℹ 75 more rows

Providing custom weights also lets us use ww_local_moran_i_vec() to add a column to our original data frame with our statistic, which makes plotting using our original geometries easier:

library(ggplot2)

weights <- ww_build_weights(guerry)

guerry %>%
  mutate(
    pred = predict(lm(Crm_prs ~ Litercy, .)),
    .estimate = ww_local_moran_i_vec(Crm_prs, pred, weights)
  ) %>%
  sf::st_as_sf() %>%
  ggplot(aes(fill = .estimate)) +
  geom_sf() +
  scale_fill_gradient2(
    "Local Moran",
    low = "#018571",
    mid = "white",
    high = "#A6611A"
  )

This makes it easy to see what areas are poorly represented by our model (which have the highest local Moran values), which might lead us to identify ways to improve our model or help us identify caveats and limitations of the models we’re working with.

Other functions in waywiser will allow you to calculate the p-value associated with spatial autocorrelation metrics. You can calculate these alongside the autocorrelation metrics themselves using yardstick::metric_set():

moran <- yardstick::metric_set(
  ww_global_moran_i,
  ww_global_moran_pvalue
)

guerry %>%
  mutate(pred = predict(lm(Crm_prs ~ Litercy, .))) %>%
  moran(Crm_prs, pred)
#> # A tibble: 2 × 3
#>   .metric             .estimator .estimate
#>   <chr>               <chr>          <dbl>
#> 1 global_moran_i      standard    4.12e- 1
#> 2 global_moran_pvalue standard    7.23e-10

These functions can also be used on their own to help qualitatively identify regions of concern, which may be poorly represented by your model:

guerry %>%
  mutate(
    pred = predict(lm(Crm_prs ~ Litercy, .)),
    .estimate = ww_local_moran_pvalue_vec(Crm_prs, pred, weights)
  ) %>%
  sf::st_as_sf() %>%
  ggplot(aes(fill = .estimate < 0.01)) +
  geom_sf() +
  scale_fill_discrete("Local Moran p-value < 0.01?") +
  theme(legend.position = "bottom")

This can help identify new predictor variables or other promising refinements to a model during the iterative process of model development. You shouldn’t report p-values without other context as results of your model, but this approach can help qualitatively assess a model during the development process. To use these tests for inference, consider using functions from spdep directly; each autocorrelation function in waywiser links to the spdep function it wraps from its documentation.