---
title: "Use Case 04: Estimation of optimal cluster size for a trial with pre-determined buffer width"
output:
  rmarkdown::html_vignette:
  toc: true
vignette: >
  %\VignetteIndexEntry{Use Case 04: Estimation of optimal cluster size for a trial with pre-determined buffer width}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Both the number and size of clusters affect power calculations for CRTs:

  + If there are no logistical constraints, and spillover can be neglected (as in trials of vaccines that enrol only small proportions of the population), there is no need for a buffer zone and the most efficient design is an individually randomized CRT (i.e. a cluster size of one). In general, a trial with many small clusters has more power than one with the
same number of individuals enrolled in larger clusters.
  + If spillover is an issue, and it is decided to address this by including buffer zones, then the number of individuals included in the trial is less than the total population. Enumeration and intervention allocation are still required for the full trial area, so there can be substantial resource implications if many people are included in the buffers. There is a trade-off between increasing power by creating many small clusters (leading to a large proportion of locations in buffer zones) and reducing the proportion of locations in
buffer zones by using large clusters.

The `CRTspat` package provides functions for analysing this trade-off for any site for which baseline data are available.
The example shown here uses the baseline prevalence data introduced in [Use Case 1](Usecase1.html). The trial is assumed to plan to be based on the same outcome of prevalence, and to be powered for an efficacy of 30%. A set of different algorithmic cluster allocations are carried out with different numbers of clusters. Each allocation is randomized and buffer zones are specified with the a pre-specified width (in this example, 0.5 km). The ICC is computed from the baseline data, excluding the buffer zones, and corresponding power calculations are carried out. The power is calculated and plotted as a function of cluster size.


``` r
# use the same dataset as for Use Case 1.
library(CRTspat)
example_locations <- readdata('example_site.csv')
example_locations$base_denom <- 1
exampleCRT <- CRTsp(example_locations)
example <- aggregateCRT(exampleCRT,
      auxiliaries = c("RDT_test_result", "base_denom"))
# randomly sample an array of numbers of clusters to allocate
set.seed(5)
c_vec <- round(runif(20, min = 6, max = 60))

CRTscenario <- function(c, CRT, buffer_width) {
  ex <- specify_clusters(CRT, c = c, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = buffer_width)
    GEEanalysis <- CRTanalysis(ex, method = "GEE", baselineOnly = TRUE, excludeBuffer = TRUE,
                               baselineNumerator = "RDT_test_result", baselineDenominator = "base_denom")
  locations <- GEEanalysis$description$locations
  ex_power <- CRTpower(trial = ex, effect = 0.3, yC = GEEanalysis$pt_ests$controlY,
                    outcome_type = "p", N = GEEanalysis$description$sum.denominators/locations, c = c,
                    ICC = GEEanalysis$pt_ests$ICC)
  value <- c(c_full = c, c_core = ex_power$geom_core$c, clustersRequired = ex_power$geom_full$clustersRequired,
             power = ex_power$geom_full$power, mean_h = ex_power$geom_full$mean_h,
             locations = locations, ICC = GEEanalysis$pt_ests$ICC)
  names(value) <- c("c_full", "c_core", "clustersRequired", "power", "mean_h", "locations", "ICC")
  return(value)
}

results <- t(sapply(c_vec, FUN = CRTscenario, simplify = "array", CRT = example,
    buffer_width = 0.5)) %>%
    data.frame()
```

```
## (Intercept) 
##  -0.9370939
```

```
## (Intercept) 
##  -0.8589939
```

```
## (Intercept) 
##   -0.935533
```

```
## (Intercept) 
##  -0.8964881
```

```
## (Intercept) 
##   -1.023863
```

```
## (Intercept) 
##  -0.9778199
```

```
## (Intercept) 
##   -1.103676
```

```
## (Intercept) 
##   -1.003547
```

```
## (Intercept) 
##  -0.9010233
```

```
## (Intercept) 
##  -0.9686587
```

```
## (Intercept) 
##   -1.221709
```

```
## (Intercept) 
##   -1.118359
```

```
## (Intercept) 
##  -0.9804311
```

```
## (Intercept) 
##   -1.210983
```

```
## (Intercept) 
##   -1.140902
```

```
## (Intercept) 
##   -1.077208
```

```
## (Intercept) 
##   -1.252144
```

```
## (Intercept) 
##  -0.9988478
```

```
## (Intercept) 
##    -1.17822
```

```
## (Intercept) 
##  -0.9236708
```
Each simulated cluster allocation is different, as are the randomizations.  This leads to variation in the locations of the buffer zones, so the number of core clusters is a stochastic function of the number of clusters randomised (c). There is also variation in the estimated Intracluster Correlation (see [Use Case 3](Usecase3.html)) for any value of c.


``` r
total_locations <- example$geom_full$locations
results$proportion_included <- results$c_core * results$mean_h * 2/total_locations
results$corelocations_required <- results$clustersRequired * results$mean_h
results$totallocations_required <- with(results, total_locations/locations *
    corelocations_required)

library(ggplot2)
theme_set(theme_bw(base_size = 14))
ggplot(data = results, aes(x = c_full, y = c_core)) + geom_smooth() + xlab("Clusters allocated (per arm)") +
    ylab("Clusters in core (per arm)") + geom_segment(aes(x = 5, xend = 35,
    y = 18.5, yend = 18.5), arrow = arrow(length = unit(1, "cm")), lwd = 2,
    color = "red")
```
<p>
    <img src="example4b.r-1.png" alt="" > <br>
    <em>Fig 4.1 Numbers of clusters</em>
</p>

The number of clusters in the core area increases with the number of
clusters allocated, until the cluster size becomes small enough for
entire clusters to be swallowed by the buffer zones. This can be
illustrated by the contrast in the core areas randomised with c = 6
and c = 40 (Figures 4.2 and 4.3).


``` r
set.seed(7)
library(dplyr)
example6 <- specify_clusters(example, c = 6, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = 0.5)
plotCRT(example6, map = TRUE, showClusterBoundaries = TRUE, showClusterLabels = TRUE,
    labelsize = 2, maskbuffer = 0.2)
example40 <- specify_clusters(example, c = 40, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = 0.5)
plotCRT(example40, map = TRUE, showClusterBoundaries = TRUE, showClusterLabels = TRUE,
    labelsize = 2, maskbuffer = 0.2)
```
<p>
    <img src="example4c.r-1.png" alt=""> <br>
    <em>Fig 4.2 Map of clusters with c = 6</em>
</p>
<p>
    <img src="example4c.r-2.png" alt=""> <br>
    <em>Fig 4.3 Map of clusters with c = 40</em>
</p>

Beyond this point, increasing the number of clusters allocated in the fixed area (by
making them smaller) does not add to the total number of clusters. In this example the maximum is achieved when the
input c is about 35 and the output c is 18.5.


``` r
ggplot(data = results, aes(x = c_core, y = mean_h)) + geom_smooth() + xlab("Clusters in core (per arm)") +
    ylab("Mean cluster size")
```
<p>
    <img src="example4d.r-1.png" alt="" > <br>
    <em>Fig 4.4 Size of clusters</em>
</p>

The size of clusters decreases with the number allocated (Figure 4.4), but does
not fall much below 10 locations on average in the example because
smaller clusters are likely to be absorbed into the buffer zones.


``` r
ggplot(data = results, aes(x = c_core, y = power)) + geom_smooth() + xlab("Clusters in core (per arm)") +
    ylab("Power")
```
<p>
    <img src="example4e.r-1.png" alt="" > <br>
    <em>Fig 4.5 Power achievable with given site</em>
</p>

The power increases approximately linearly with the number of
clusters in the core (Figure 4.5), but the site is too small for an adequate power
to be achieved with this size of buffer, irrespective of the cluster
size.  Because the buffering leads to a maximum in the cluster
density (number of clusters per unit area), so does the power
achievable with a fixed area (Figure 4.6).


``` r
ggplot2::ggplot(data = results, aes(x = c_full, y = power)) + geom_smooth() +
    xlab("Clusters allocated (per arm)") + ylab("Power")
```
<p>
    <img src="example4f.r-1.png" alt="" > <br>
    <em>Fig 4.6 Power achievable with given site</em>
</p>

However the analysis also gives an estimate of how large an extended site is
needed to achieve adequate power (assuming the the spatial pattern
for the wider site to be similar to that of the baseline area). A minimum
total number of locations required to achieve a pre-specified power
(80%) is achieved at the same density of clusters as the maximum of
the power estimated for the smaller, baseline site.


``` r
ggplot2::ggplot(data = results, aes(x = c_core, y = corelocations_required)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Required core locations")
```

```
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
```
<p>
    <img src="example4g.r-1.png" alt="" > <br>
    <em>Fig 4.7 Number of clusters required for full trial area </em>
</p>

This is also at the allocation density where saturation is achieved
in the number of core clusters (Figure 4.1), and where the proportion of
the locations included in the core area reaches its minimum (Figure 4.8).


``` r
ggplot2::ggplot(data = results, aes(x = c_core, y = proportion_included)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Proportion of locations in core") +
    geom_segment(aes(x = 18, xend = 18, y = 0, yend = 0.25), arrow = arrow(length = unit(1,
        "cm")), lwd = 2, color = "red")
```
<p>
    <img src="example4h.r-1.png" alt="" > <br>
    <em>Fig 4.8 Proportions of locations in core</em>
</p>

#### Conclusions
With the example geography and the selected trial outcome, the most efficient trial design, conditional on a buffer width of 0.5 km, would be achieved by assigning about 30 clusters to each arm in a site of the size analysed, though about one third of these clusters would eliminated
by inclusion in the buffer zones, so that there would be . This would be far from adequate to achieve
adequate power. To achieve 80% power about 8,000 locations would be needed, in a larger trial area, of which about 2,400
would be in the core (sampled) parts of the clusters.


``` r
ggplot2::ggplot(data = results, aes(x = c_core, y = totallocations_required)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Total locations required") +
    geom_segment(aes(x = 18, xend = 18, y = 0, yend = 8000), arrow = arrow(length = unit(1,
        "cm")), lwd = 2, color = "red")
```
<p>
    <img src="example4i.r-1.png" alt="" > <br>
    <em>Fig 4.9 Size of trial area required to achieve adequate power</em>
</p>