smerc-demo

Joshua P. French

2023-10-08

Introduction

The smerc package is focused on Statistical Methods for Regional Counts. It particularly focuses on the spatial scan method (Kulldorff, 1997) and many of its extensions.

In what follows, we demonstrate some of the basic functionality of the smerc package using 281 observations related to leukeumia cases in an 8 county area of the state of New York. The data were made available in Waller and Gotway (2005) and details are provided there.

The leukemia data are available in the nydf dataframe. Each row of the dataframe contains information regarding a different region. Each row of the dataframe includes longitude and latitude information for the centroid of each region, transformed x and y coordinates available in the original dataset provided by Waller and Gotway (2005), as well as the population of each region and the number of leukemia cases.

library(smerc) # load package
## # This research was partially supported under NSF grants 1463642 and 1915277
data(nydf) # load data
str(nydf) # look at structure
## 'data.frame':    281 obs. of  6 variables:
##  $ longitude : num  -75.9 -75.9 -75.9 -75.9 -75.9 ...
##  $ latitude  : num  42.1 42.1 42.1 42.1 42.1 ...
##  $ population: num  3540 3560 3739 2784 2571 ...
##  $ cases     : num  3.08 4.08 1.09 1.07 3.06 ...
##  $ x         : num  4.07 4.64 5.71 7.61 7.32 ...
##  $ y         : num  -67.4 -66.9 -67 -66 -67.3 ...

We plot the study area below using information in the data object.

data(nysf) # load nysf data
library(sf) # load sf package for plotting
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
plot(st_geometry(nysf)) # plot study area

Scan methods

The spatial scan test can be perfomed using the scan.test function. The user must supply the coordinates of the region centroids, the number of cases in each region, and the population (i.e., the number persons at risk). There are many additional optional arguments described in the documentation. In the following example, we reduce the number of Monte Carlo simulations used to estimate the p-value.

coords = nydf[,c("x", "y")] # extract coordinates
cases = nydf$cases # extract cases
pop = nydf$population # extract population
scan_out = scan.test(coords, cases, pop, nsim = 99) # perform scan test
## computing statistics for simulated data:

The scan.test function (and most of the other *.test functions in the smerc package) will return an object of class smerc_cluster.

class(scan_out)
## [1] "smerc_cluster"

A smerc_cluster object has a default print option that summarizes relevant details about the object produced by the function.

In this case, we receive the following:

scan_out # print scan.test results
## method: circular scan
## statistic: poisson
## simulation: multinomial
## realizations: 99
## population upperbound: 50%
## minimum cases: 2

In this particular case, we see that the scan_out object summarizes the results of the circular scan method using a poisson statistic (the other choice is binomial). The Monte Carlo p-value was estimated using 99 realizations from a multinomial distribution (poisson and binomial are the other possibilities). The candidate zones were limited to having no more than 50% of the total population and at least 2 cases had to be in a candidate zone to be considered a cluster. Other methods will provide some of the same information, but some of the information will be different based on the relevant parameters of the models.

A smerc_cluster also has a summary function that summarizes the significant (or most likely) clusters returned by the method.

summary(scan_out) # summarize scan.test results
##   nregions max_dist    cases   ex  rr stat    p
## 1       24     12.3 95.33108 55.8 1.8 13.1 0.01
## 2       11     26.5 49.71990 27.1 1.9  8.0 0.03

The summary of scan_out reveals that there were two significant clusters. The first cluster was comprised of 24 regions with a maximum distance of 12.3 between centroids. The number of cases observed in the cluster is 95.33, while 55.8 cases were expected. The relative risk of the cluster is 1.8, with an associated test statistic of 13.1, and a Monte Carlo p-value of 0.01. A second cluster is also summarized with similar information.

A very basic plot mechanism is provided for smerc_cluster objects. The plot will show the locations of every centroid, with significant clusters shown with different colors and symbols. An attempt is also made to show the connectivity of the regions. In the plot below, we see the two significant clusters shown in yellow (middle) and purple (bottom).

plot(scan_out) # basic plot of scan.test results

If you want to extract the detected clusters from a smerc_cluster object, then the clusters function can be used. The function returns a list: each element of the list is a vector with the indices of the regions included in the cluster. The first element of the list contains the regions comprising the most likely cluster, the second element contains the regions for the second most likely cluster, etc.

clusters(scan_out)
## [[1]]
##  [1] 52 50 49 48 15 16  1 51 37 38 47  2 14 39 53 13 34 40  3 44 17 43 12 46
## 
## [[2]]
##  [1]  88  86  87  89  92  91  85  93  84  90 259

If a polygon-like structure is associated with each region, then the color.clusters can be used to easily produce nicer plots. E.g., we can use the nysf object to create a nicer map of our results.

plot(st_geometry(nysf), col = color.clusters(scan_out)) #nicer plot of scan.test results

Other scan methods available include:

Other methods

Other non-scan method for the detection of clusters and/or clustering of cases for regional data are provided.

bn.test performs the Besag-Newell test (Besag and Newell, 1991) and returns a smerc_cluster. The print, summary, and plot methods are available, as before.

bn_out = bn.test(coords = coords, cases = cases, pop = pop, cstar = 20,
                 alpha = 0.01) # perform besag-newell test
bn_out # print results
## method: Besag-Newell
## case radius: 20
## modified p-value: FALSE
summary(bn_out) # summarize results
##   nregions max_dist    cases   ex  rr stat           p
## 1        5      3.0 20.42516 10.2 2.0    5 0.004132413
## 2        5     13.4 20.15705 10.5 1.9    5 0.006017272
## 3        3      2.0 20.39443 10.8 1.9    3 0.008039295
## 4        5      2.7 25.45904 11.0 2.4    5 0.009113555
## 5        5      2.4 20.29863 11.1 1.9    5 0.009962904
plot(bn_out) # plot results

The Cluster Evaluation Permutation Procedure (CEPP) proposed by Turnbull et al. (1990) can be performed using cepp.test. The function returns a smerc_cluster object with print, summary, and plot functions.

# perform CEPP test
cepp_out = cepp.test(coords = coords, cases = cases, pop = pop,
                     nstar = 5000, nsim = 99, alpha = 0.1)
cepp_out # print results
## method: CEPP
## simulation: multinomial
## realizations: 99
## population radius: 5000
summary(cepp_out) # summarize results
##   nregions max_dist    cases  ex  rr stat    p
## 1        2      3.6 9.589796 2.8 3.5  9.6 0.09
plot(cepp_out) # plot results

Tango’s clustering detection test (Tango, 1995) can be performed via the tango.test function. The dweights function can be used to construct the weights for the test. The tango.test function produces and object of class tango, which has a native print function and a plot function that compares the goodness-of-fit and spatial autocorrelation components of Tango’s statistic for the observed data and the simulated data sets.

w = dweights(coords, kappa = 1) # construct weights matrix
tango_out = tango.test(cases, pop, w, nsim = 49) # perform tango's test
tango_out # print results
## method: Tango's index
## index: 0.0029
## goodness-of-fit component: 0.0026
## spatial autocorrelation component: 0.00033
## chi-square statistic: 210
## chi-square df: 110
## chi-square p-value: 1.4e-08
## Monte Carlo p-value: 0.02
plot(tango_out) # plot results

Additional details about cluster detection methods

Nearly all cluster detection methods have both a *.zones function that returns all the candidate zones for the method and a *.sim function that is used to produce results for data simulated under the null hypothesis. These are unlikely to interest most users, but may be useful for individuals wanting to better understand how these methods work or are interested in developing new methods based off existing ones. The *.sim functions are not meant for general use, as they are written to take very specific arguments that the user could easily misuse.

As an example, the following code produces all elliptical-shaped zones considered by the elliptic scan method (Kulldorff et al., 2006) using a population upperbound of no more than 10%. The function returns a list with the 248,213 candidate zones, as well as the associated shape and angle.

# obtain zones for elliptical scan method
ezones = elliptic.zones(coords, pop, ubpop = 0.1)
# view structure of ezones
str(ezones)
## List of 3
##  $ zones:List of 239228
##   ..$ : int 1
##   ..$ : int [1:2] 1 2
##   ..$ : int [1:3] 1 2 15
##   ..$ : int [1:4] 1 2 15 49
##   ..$ : int [1:5] 1 2 15 49 50
##   ..$ : int [1:6] 1 2 15 49 50 13
##   ..$ : int [1:7] 1 2 15 49 50 13 3
##   ..$ : int [1:8] 1 2 15 49 50 13 3 14
##   ..$ : int [1:9] 1 2 15 49 50 13 3 14 48
##   ..$ : int [1:10] 1 2 15 49 50 13 3 14 48 47
##   ..$ : int [1:11] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:12] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:13] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:14] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:15] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:16] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:17] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:18] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:19] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:20] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:21] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:22] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:23] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:24] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:25] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:26] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:27] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:28] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int [1:29] 1 2 15 49 50 13 3 14 48 47 ...
##   ..$ : int 2
##   ..$ : int [1:3] 2 1 3
##   ..$ : int [1:4] 2 1 3 13
##   ..$ : int [1:5] 2 1 3 13 15
##   ..$ : int [1:6] 2 1 3 13 15 14
##   ..$ : int [1:7] 2 1 3 13 15 14 49
##   ..$ : int [1:8] 2 1 3 13 15 14 49 47
##   ..$ : int [1:9] 2 1 3 13 15 14 49 47 12
##   ..$ : int [1:10] 2 1 3 13 15 14 49 47 12 50
##   ..$ : int [1:11] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:12] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:13] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:14] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:15] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:16] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:17] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:18] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:20] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:21] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:22] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:23] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:24] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:25] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:27] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int [1:28] 2 1 3 13 15 14 49 47 12 50 ...
##   ..$ : int 3
##   ..$ : int [1:2] 3 13
##   ..$ : int [1:3] 3 13 2
##   ..$ : int [1:4] 3 13 2 12
##   ..$ : int [1:5] 3 13 2 12 14
##   ..$ : int [1:6] 3 13 2 12 14 5
##   ..$ : int [1:7] 3 13 2 12 14 5 1
##   ..$ : int [1:8] 3 13 2 12 14 5 1 10
##   ..$ : int [1:9] 3 13 2 12 14 5 1 10 11
##   ..$ : int [1:10] 3 13 2 12 14 5 1 10 11 4
##   ..$ : int [1:11] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:12] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:13] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:14] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:15] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:16] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:17] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:18] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:19] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:20] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:21] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:22] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:23] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:24] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:25] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:26] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:27] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int [1:29] 3 13 2 12 14 5 1 10 11 4 ...
##   ..$ : int 4
##   ..$ : int [1:2] 4 6
##   ..$ : int [1:3] 4 6 5
##   ..$ : int [1:4] 4 6 5 12
##   ..$ : int [1:5] 4 6 5 12 35
##   ..$ : int [1:6] 4 6 5 12 35 7
##   ..$ : int [1:7] 4 6 5 12 35 7 10
##   ..$ : int [1:8] 4 6 5 12 35 7 10 3
##   ..$ : int [1:9] 4 6 5 12 35 7 10 3 11
##   ..$ : int [1:10] 4 6 5 12 35 7 10 3 11 9
##   ..$ : int [1:11] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:12] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:13] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:14] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:15] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:16] 4 6 5 12 35 7 10 3 11 9 ...
##   ..$ : int [1:17] 4 6 5 12 35 7 10 3 11 9 ...
##   .. [list output truncated]
##  $ shape: num [1:239228] 1 1 1 1 1 1 1 1 1 1 ...
##  $ angle: num [1:239228] 90 90 90 90 90 90 90 90 90 90 ...

References

Assuncao, R.M., Costa, M.A., Tavares, A. and Neto, S.J.F. (2006). Fast detection of arbitrarily shaped disease clusters, Statistics in Medicine, 25, 723-742.

Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases, Journal of the Royal Statistical Society, Series A, 154, 327-333.

Costa, M.A. and Assuncao, R.M. and Kulldorff, M. (2012) Constrained spanning tree algorithms for irregularly-shaped spatial clustering, Computational Statistics & Data Analysis, 56(6), 1771-1783.

Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics - Theory and Methods, 26(6): 1481-1496,

Kulldorff, M., Huang, L., Pickle, L. and Duczmal, L. (2006) An elliptic spatial scan statistic. Statististics in Medicine, 25:3929-3943.

Neill, D. B. (2012), Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 337-360.

Patil, G.P. & Taillie, C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics (2004) 11(2):183-197.

Tango, T. (1995) A class of tests for detecting “general” and “focused” clustering of rare diseases. Statistics in Medicine. 14, 2323-2334.

Tango, T., & Takahashi, K. (2005). A flexibly shaped spatial scan statistic for detecting clusters. International journal of health geographics, 4(1), 11. Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics – Theory and Methods 26, 1481-1496.

Tango, T. and Takahashi, K. (2012), A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Statist. Med., 31: 4207-4218.

Turnbull, Bruce W., Iwano, Eric J., Burnett, William S., Howe, Holly L., Clark, Larry C. (1990). Monitoring for Clusters of Disease: Application to Leukemia Incidence in Upstate New York, American Journal of Epidemiology, 132(supp1):136-143.

Waller, L.A. and Gotway, C.A. (2005). Applied Spatial Statistics for Public Health Data. Hoboken, NJ: Wiley.

Yao, Z., Tang, J., & Zhan, F. B. (2011). Detection of arbitrarily-shaped clusters using a neighbor-expanding approach: A case study on murine typhus in South Texas. International journal of health geographics, 10(1), 1.