mrddGlobal: Global Testing for Multivariate Regression Discontinuity Designs

This vignette aims to explain how to use the mrddGlobal package. I will first describe the basics of global testing in multivariate regression discontinuity designs, and then describe how the package can automatically handle global testing tasks. For more details, see Samiahulin (2026).

Global Testing Basics

To describe the idea behind global testing, I will first state the general problem, and then describe the solution implemented by the mrddGlobal package.

Global Testing Setup - Treatment Effect Heterogeneity

Let \(Y_i\) denote the outcome variable for individual \(i\) with potential outcomes \(Y_i(1), Y_i(0)\) and let \(X_i\) be a vector of running variables. Also, let \(\Omega\) denote the region of treated observations with boundary \(\partial \Omega\) and treatment indicator \(T_i\) being defined as follows: \[ T_i = \begin{cases} 1, \quad X_i \in \Omega \\ 0, \quad X_i \notin \Omega \end{cases} \] If \(\tau(x) = \mathbb{E}[Y_i(1) - Y_i(0) | X_i = x]\), the primary objective is to test the following hypothesis: \[\begin{align*} H_0 &: \tau(x) = 0 \text{ for all } x \in \partial \Omega \\ H_a &: \tau(x) \neq 0 \text{ for some }x \in \partial \Omega \end{align*}\] One common approach to testing this hypothesis is to estimate \(\tau(x)\) along a discretized version of \(\partial \Omega\) and apply uniform bands for inference. While this works well for large samples, this approach often fails for small samples. When the sample size is large, each boundary point typically has many neighboring observations, which makes estimation and inference reliable. With a small overall sample size, boundary points may not have many neighboring observations, making asymptotic approximations unreliable.

With that, the approach used in the mrddGlobal package uses machine learning and distance-based aggregation to pool points together for reliable global testing. More specifically, for the most clean estimation and inference procedure, the following steps are taken:

  1. Split the data into K disjoint folds. Let \(I_k\) denote the set of observations in the test set \(k\), with the other folds (training set) defined as \(I_k^c\).
  2. Let \(\gamma(X_i)\) be the closest boundary point to \(X_i\). Using the training set, estimate \(\tau(\cdot)\) at \(\gamma(X_i)\) for all \(i \in I_k\). For the estimates of \(\tau(\cdot)\), use the local linear forest estimator of (Friedberg et al. 2020). Define \(\Gamma_i\) as follows:

\[ \Gamma_i = \begin{cases} 1, \quad \tau(\gamma(X_i)) \geq 0 \\ 0, \quad \tau(\gamma(X_i)) < 0 \end{cases} \]

  1. Using the estimates from the previous step, estimate the following via a local linear regression (where \(g_i\) is the signed distance between \(X_i\) and \(\gamma(X_i)\).):

\[ \tau_\Gamma = \lim_{\epsilon \to 0^+} \mathbb{E}[(2\Gamma_i - 1) Y_i | g_i = \epsilon] - \lim_{\epsilon \to 0^-} \mathbb{E}[(2\Gamma_i - 1) Y_i | g_i = \epsilon] \]

  1. For inference, estimate \(\tau_\Gamma\) using a local quadratic regression and apply the multiplier bootstrap of Chiang, Hsu, and Sasaki (2019) to test whether \(\tau_\Gamma = 0\) or \(\tau_\Gamma > 0\).

If \(g(X_i)\) represents the signed distance between \(X_i\) and \(\gamma(X_i)\), \(\tau_\Gamma\) can also be rewritten as follows:

\[ \tau_\Gamma = \int_{x:g(x) = 0} |\tau(x)| \left( \frac{f_X(x)}{\int_{x: g(x) = 0} f_X(x) dS_\Omega} \right) dS_\Omega \]

where \(S_\Omega\) is a Hausdorff measure over \(\partial \Omega\). Intuitively, the new univariate test statistic represents a density-weighted average of \(|\tau(x)|\) over the boundary of the treated region. This means that if \(\tau_\Gamma > 0\), then \(\tau(x) \neq 0\) for some \(x \in \partial \Omega\). However, if \(\tau_\Gamma = 0\), then \(\tau(x) = 0\) for all \(x \in \partial \Omega\). The global testing procedure above is implemented via the cef_disc_test function in the mrddGlobal package.

Global Testing Implementation via mrddGlobal

To install the mrddGlobal package, run the following code:

install.packages("mrddGlobal")

To load the package, use the following:

library(`mrddGlobal`)

To start using the package, the following are required for all tests:

For the boundary treatment effect heterogeneity test, a vector of outcomes (Y) will also be needed. For the density test, information about the support of your data will be required (more details on this will be given later in this document).

Closest Boundary Point

In all tests, the user will need to find the closest boundary point for each individual’s collection of running variables. Here, I will discuss three ways one can go about defining this matrix.

Option 1: Use an analytical solution

If the boundary of the treated region \(\partial \Omega\) can be easily expressed mathematically, then one can find the closest boundary points by solving the following, where \(\gamma_i^*\) is the closest boundary point to \(X_i\) and \(||\cdot||_2\) is the \(L_2\) norm:

\[ \gamma_i^* = \operatorname*{argmin}_{\gamma \in \partial \Omega} ||X_i - \gamma||_2 \]

For instance, in many multivariate RDD settings, the boundary can be expressed by the set of points where \(\min\{X_1, X_2, ..., X_d\} = 0\). This happens in settings where treatment is assigned to observations where all running variables are higher than a certain threshold. Some examples include students passing multiple exams to get into a program or municipalities passing certain pre-defined thresholds to be eligible for funding. In these cases, the \(j^{\text{th}}\) coordinate of \(\gamma_i\) is given as follows:

\[ \gamma_{ij}^* = \begin{cases} 0, \quad X_{ij} > 0 \quad \forall j, \quad X_{ij} = \min_j\{X_{ij}\} \\ X_{ij}, \quad X_{ij} > 0 \quad \forall j, \quad X_{ij} \neq \min_j\{X_{ij}\} \\ \max\{X_{ij}, 0\}, \quad \exists k \text{ s.t. } X_{ik} < 0 \end{cases} \]

Option 2: Exact geometry solution

In some cases, finding the closest boundary point analytically can be challenging. This is especially true in cases where the boundary of the treated region is geographic and complex. In this case, a shape file can be used to represent the treated region and find the boundary points. For instance, here is a shape file for North Carolina with some randomly-generated observations:

library(sf)
library(ggplot2)

# Loading in shape file data
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source 
#>   `C:\Users\User\AppData\Local\R\win-library\4.4\sf\shape\nc.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
nc <- st_union(nc)

# Generating points in and out of NC
bbox <- st_bbox(nc)
n <- 2000
pts <- data.frame(
  x = runif(n, bbox["xmin"], bbox["xmax"]),
  y = runif(n, bbox["ymin"], bbox["ymax"])
)
knitr::kable(head(pts))
x y
-82.61074 34.53275
-79.80422 34.15976
-78.94235 33.93203
-79.60517 34.25315
-78.00949 34.87519
-83.58188 34.44275

Here is a plot of the points around North Carolina:

pts_sf <- st_as_sf(pts, coords = c("x", "y"), crs = st_crs(nc))
inside <- st_within(pts_sf, nc, sparse = FALSE)[,1]
pts_sf$status <- ifelse(inside, "Inside NC", "Outside NC")

ggplot() +
  geom_sf(data = nc, fill = NA, color = "black", linewidth = 0.6) +
  geom_sf(data = pts_sf, aes(color = status), alpha = 0.6, size = 1) +
  coord_sf() +
  scale_color_manual(values = c("Inside NC" = "blue", "Outside NC" = "red")) +
  theme_minimal() +
  labs(
    title = "Random Points Inside and Outside North Carolina",
    color = ""
  )

From here, we can use the get_closest_points_shp function to easily find the closest boundary point using our shape file and running variable matrix.

closest_points_all <- get_closest_points_shp(X = pts, shp_file = nc)
knitr::kable(head(closest_points_all))
X Y
-82.69736 35.09123
-79.34456 34.53283
-78.78505 34.05906
-79.30294 34.49778
-77.60973 34.43504
-83.58110 34.98957

Here is a plot of the closest boundary points:

closest_points_geo <- st_as_sf(as.data.frame(closest_points_all), coords = c("X", "Y"), crs = 4326)
ggplot() +
  geom_sf(data = nc, fill = NA, color = "black", linewidth = 0.6) +
  geom_sf(data = closest_points_geo, alpha = 0.6, size = 1, color = "purple") +
  coord_sf() +
  theme_minimal() +
  labs(
    title = "Closest Boundary Points for North Carolina Observations",
    color = ""
  )

Option 3: Point cloud approach

When the boundary of the treated region is complex and not easy to use, a third option for (approximately) finding the closest boundary point is to discretize the boundary and find the nearest neighbor between each observation and the discretized boundary (also known as a point cloud). For instance, consider the boundary defined by the following equation with three running variables:

\[ \sum_{j=1}^3 x_j^2 + 0.5\sin\left( \sum_{j=1}^3 x_j^4 \right) = 1 \]

Here is a plot of this boundary:

This boundary is not easy to use for an analytical solution because of the sin term. On top of that, this boundary is also difficult to express in terms of a shape file because of the curvature and dimension. So, to approximate the boundary, I will create a discretized version of the boundary using the make_cloud function (note that this function may take a few minutes to run):

# First, I will create a simulated dataframe
set.seed(1)
n <- 2000
df <- matrix(runif(n * 3, -1.2, 1.2), ncol = 3)

# Then I will define the function that determines the boundary
bndry_func <- function(X) {
  rowSums(X^2) + 0.5 * sin(rowSums(X^4)) - 1
}

# Finally, I will get just the point cloud from the make_cloud function
cloud <- make_cloud(X = df, bndry_func = bndry_func, num_pts_discr = 2e4, step_size = 1e-1, tol = 1e-2,
                    verbose = FALSE)
knitr::kable(head(cloud))
0.7668274 0.3885158 -0.2625383
0.8199770 0.2366764 -0.2318826
0.8035395 0.2787459 -0.2719072
0.7505699 0.3527048 -0.3782530
0.6898971 0.4623688 -0.4030039
0.7390081 0.4449633 -0.3068749

The make_cloud function works to discretize the boundary by starting at point start_pt. Then, a new candidate point is selected by perturbing the starting point coordinates with normal noise with standard deviation step_size. If the new point is within tol distance of the actual boundary (as determined by bndry_func), then this new point is saved as a boundary point. If the new point is closer to the boundary than the current point, then start the search for a boundary point at the new point. If the new point is farther from the boundary than the current point, then discard the new point. This process is repeated until num_pts_discr boundary points are found.

If step_size is chosen too small, the algorithm implemented in make_cloud may terminate before covering the entire boundary. Visualizing the resulting point cloud provides a simple diagnostic to verify that all boundary regions are adequately represented. From here, to get the closest boundary points, we can use the get_closest_points_bbox function.

treated <- ifelse(bndry_func(df) <= 0, 1, 0)
df_bndry <- get_closest_points_bbox(cloud = cloud, X = df, t = treated)
closest_points_all <- df_bndry[["closest_points"]]
knitr::kable(head(closest_points_all))
-0.4284417 0.6468775 -0.5179465
-0.2523814 0.8377598 0.0119770
0.1452603 0.5962210 -0.6817302
0.8574082 -0.1300664 0.0200559
-0.4488960 -0.5057970 0.6435655
0.6264579 -0.6487247 -0.1714592

Boundary Treatment Effect Heterogeneity Test

To implement the global treatment effect heterogeneity test, we’ll first need to set up our data. Here is an example using two running variables, where the boundary is defined by \(\min\{x_1, x_2\} = 0\). In this case, observations with \(x_1, x_2 \geq 0\) will be treated, while the others are not. Also, the outcome will be defined as follows:

\[ Y = \frac{1}{3} \begin{cases} X_1 + X_2, \quad X_1, X_2 \in [0, 1] \\ 1 - X_1, \quad X_1 \in [0, 1], X_2 \in [-1, 0] \\ 1 - X_2, \quad X_1 \in [-1, 0], X_2 \in [0, 1] \\ 1, \quad X_1, X_2 \in [-1, 0] \end{cases} + \epsilon \]

where \(\epsilon \sim N(0, \sigma_\epsilon^2)\), \(X_1, X_2\) take values in \([-1, 1]\), and \(f_X(x) = 0.25\) with \(X_1\) independent of \(X_2\). For the boundary heterogeneity test, the target parameter is given as follows:

\[ \tau_\Gamma = \int_{x:g(x) = 0} |\tau(x)| \left( \frac{f_X(x)}{\int_{x: g(x) = 0} f_X(x) dS_\Omega} \right) dS_\Omega \]

For the data generated above, \(\tau_\Gamma = 1/6\). The code for generating this data is below:

# Closest boundary points are found analytically since the boundary is simple
closest_boundary_point <- function(df) {
  # Separating individual coordinates and initializing output matrix
  x1 <- df[, 1]
  x2 <- df[, 2]
  result <- matrix(0, nrow = nrow(df), ncol = 2)
  
  # Case 1: both coordinates are negative
    # Closest boundary point is the origin
  mask1 <- x1 < 0 & x2 < 0
  result[mask1, ] <- cbind(0, 0)
  
  # Case 2: first coordinate is positive and larger than the second coordinate 
    # Closest boundary point is either directly below or above on the x axis
  mask2 <- x1 >= 0 & x1 > x2 & !mask1
  result[mask2, ] <- cbind(x1[mask2], 0)
  
  # Case 3: second coordinate is positive and larger than the first coordinate 
    # Closest boundary point is either directly left or right on the y axis
  mask3 <- x2 >= 0 & x1 < x2 & !mask1
  result[mask3, ] <- cbind(0, x2[mask3])
  
  return(result)
}

sim_data <- gen_data_cef(1, n = 1000, seed = 1)
X <- as.matrix(sim_data[,1:2])
Y <- as.matrix(sim_data[,3])
t <- ifelse(X[,1] >= 0 & X[,2] >= 0, 1, 0)
closest_points_all <- closest_boundary_point(X)

Now that the necessary data is prepared, the treatment effect heterogeneity test can be performed using the cef_disc_test function from the mrddGlobal package:

results <- cef_disc_test(X, Y, t, closest_points_all, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate            :  0.2083
#> Bias-Corrected Estimate  :  0.1830
#> Standard Error           :  0.0849
#> P-Value                  :  0.0164

The first number (tau) gives the estimate of \(\tau_\Gamma\), which is around the expected 1/6. The second number (tau_bc) gives the bias-corrected estimate used for inference. The third number (se_tau) gives the standard error for the bias-corrected estimate, while the fourth number (pval) gives the p-value for the treatment effect heterogeneity test. More specifically, tau estimates a density-weighted average of the absolute value of the treatment effects along the boundary between treated and control regions. The p-value tells us that, at the 5% level, we can reject the null hypothesis that treatment effects are zero along the entire boundary.

By default, the cef_disc_test function only performs one split in the K-fold cross-fitting procedure. To perform more splits and improve the reliability of the hypothesis test, more splits can be added using the num_splits argument.

results <- cef_disc_test(X, Y, t, closest_points_all, num_splits = 10, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate            :  0.1917
#> Bias-Corrected Estimate  :  0.1971
#> Standard Error           :  0.0641
#> P-Value                  :  0.0008

To see some of the split- and fold-level bandwidths and estimates, we can use the summary method. The full bandwidth and estimate matrices can be extracted from the output of cef_disc_test:

summary(results)
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate            :  0.1917
#> Bias-Corrected Estimate  :  0.1971
#> Standard Error           :  0.0641
#> P-Value                  :  0.0008
#> 
#> Split- and fold-level results:
#>                 Bandwidths (Left) Bandwidths (Right) Main Estimates
#> Split 1, Fold 1           0.37523            0.37523        0.25214
#> Split 1, Fold 2           0.94176            0.94176        0.16439
#> Split 2, Fold 1           0.77495            0.77495        0.21673
#> Split 2, Fold 2           0.58620            0.58620        0.11791
#> Split 3, Fold 1           0.48849            0.48849        0.27545
#> Split 3, Fold 2           0.58963            0.58963        0.14853
#> Split 4, Fold 1           0.91257            0.91257        0.24796
#> Split 4, Fold 2           0.37659            0.37659        0.20019
#> Split 5, Fold 1           0.48951            0.48951        0.22015
#> Split 5, Fold 2           0.62055            0.62055        0.13772
#>                 Bias-Corrected Estimates
#> Split 1, Fold 1                   0.2323
#> Split 1, Fold 2                   0.1336
#> Split 2, Fold 1                   0.2057
#> Split 2, Fold 2                   0.0858
#> Split 3, Fold 1                   0.2522
#> Split 3, Fold 2                   0.1508
#> Split 4, Fold 1                   0.2134
#> Split 4, Fold 2                   0.3039
#> Split 5, Fold 1                   0.2366
#> Split 5, Fold 2                   0.1461
#> 
#> [showing first 10 of 20 rows]

When adding more splits, both the main and bias-corrected estimates did not change much. However, adding more splits reduced the standard error and p-value associated with the hypothesis test. In other words, the main test statistic became more stable as the number of splits increased. Although the efficiency gains can be large at first, these improvements do not continue past a certain number of splits, as shown below:

results <- cef_disc_test(X, Y, t, closest_points_all, num_splits = 15, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate            :  0.1966
#> Bias-Corrected Estimate  :  0.1993
#> Standard Error           :  0.0650
#> P-Value                  :  0.0006

Multivariate Density Test Basics

Before testing for treatment effects along \(\partial \Omega\), it may be interesting to assess the plausibility of running variable manipulation. To do this, we can check whether the joint density of the running variables is discontinuous along \(\partial \Omega\). More formally, define the following:

\[\begin{align*} f_X^+(x) & = \lim_{x' \rightarrow x, g_\Omega(x') \geq 0} f_X(x') , \quad f_X^-(x) = \lim_{x' \rightarrow x, g_\Omega(x') < 0} f_X(x') \\ \tau_f(x) & = f_X^+(x) - f_X^-(x) \end{align*}\]

In this case, the hypothesis of interest is as follows:

\[\begin{align*} H_0 &: \tau_f(x) = 0 \text{ for all } x \in \partial \Omega \\ H_a &: \tau_f(x) \neq 0 \text{ for some }x \in \partial \Omega \end{align*}\]

To test this hypothesis, let \(\Lambda_i\) be defined as follows:

\[ \Lambda_i = \begin{cases} 1, \quad \tau_f(X_i) \geq 0 \\ 0, \quad \tau_f(X_i) < 0 \end{cases} \]

Using a similar K-fold cross-fitting procedure from before, we can test the hypothesis above by testing for a discontinuity in the density of \(g_i^* = (2\Lambda_i - 1) g_i\) at zero. The size of the discontinuity is equivalent to the following:

\[ \int_{x: g(x) = 0} |\tau_f(x)| dS_\Omega \]

To estimate the discontinuity in the density of \(g_i^*\), we can plug in an estimated version of \(g_i^*\) (\(\hat{g}_i\)) use the following boundary bias-corrected kernel density estimator:

\[ \hat{f}_{g^*}(0)^+ - \hat{f}_{g^*}(0)^- = \sum_{i=1}^n \left( \frac{1}{nh_k} \right) \left(\alpha_1 + \alpha_2 \left| \frac{\hat{g}_{i}}{h_k} \right| \right) K\left( \frac{\hat{g}_{i}}{h_k} \right) (2\delta_i-1) \]

where \(\delta_i = 1\) if \(\hat{g}_i \geq 0\) (zero otherwise), \(h_k\) is the bandwidth used for observations in fold \(k\), \(K(\cdot)\) is a kernel function (Epanechnikov kernel for this package), and \(\alpha_1, \alpha_2\) solve the following system of equations:

\[ \begin{pmatrix} \int_0^1 K(u) du & \int_0^1 u K(u) du \\ \int_0^1 u K(u) du & \int_0^1 u^2 K(u) du \end{pmatrix} \begin{pmatrix} \alpha_1 \\ \alpha_2 \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \]

The coefficients \(\alpha_1, \alpha_2\) remove boundary bias and first-order bias, allowing the new kernel to behave like a standard kernel at interior points. The base kernel here will be an Epanechnikov kernel. This procedure is automatically handled by the density_disc_test function.

Estimating Joint Density Discontinuities

For the multivariate density test, a variation of the random forest density estimator (RFDE) of Wen and Hang (2022) is used to estimate \(\tau_f(\cdot)\). The full algorithm goes as follows for either the treated or control region:

  1. Define a bounding box
    Construct the smallest axis-aligned rectangle that encloses the region of interest.

  2. Choose a split direction
    Select one coordinate axis uniformly at random.

  3. Split the region
    Along the selected axis, split the current rectangle at its midpoint. Alternatively, one may draw two points independently from a uniform distribution over the rectangle’s extent along the chosen axis and split at their average (experimental option).

  4. Grow the tree recursively
    Repeat Steps 2–3 on each resulting sub-rectangle until a user-specified maximum tree depth is reached.

  5. Estimate support coverage
    For each terminal rectangle (leaf), estimate the fraction of its volume that lies within the support of the data. This fraction is approximated by sampling points uniformly within the rectangle and computing the proportion that fall inside the support.

  6. Estimate leaf densities
    Estimate the density within each leaf by dividing the proportion of observations assigned to that leaf by the volume of the leaf.

  7. Aggregate across trees
    Repeat Steps 1–6 independently \(T\) times to obtain \(T\) tree-based density estimators, and average these estimates to form the final random forest density estimate.

As an example, suppose the treated region is defined as follows: \[ \Omega = \{x_1, x_2 \in \mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq 0.5^2\} \]

To estimate the density of the running variables at the boundary of this treated region, we start by defining a bounding box (which in this case is \([0, 1]^2\)). Next, we split this box recursively until some desired depth is reached. Below is an example where the maximum depth is 3.

For the red evaluation point, the density is estimated as the fraction of observations in its leaf divided by the leaf’s effective volume. Since only part of the support intersects the leaf, we can approximate this volume by multiplying the area of the rectangular leaf by the fraction of the support within the cell. This fraction is computed by randomly generating points uniformly within the leaf and calculating the proportion that fall inside the support. For instance, here is an example using the top left cell.

Support Data

To do the density test, there are two pieces of information that are required:

  1. Bounding box limits for the splitting algorithm.
  2. Functions that indicate whether an observation is treated or not (for the support volume calculations).

To get the bounding box limits for the treated/control regions, there are three ways of deriving these limits. The first way is analytically, where the bounds of the treated/control region are easy to derive from the geometry of the given regions. For instance, for the treated region defined by \(\Omega = \{x_1, x_2 \in \mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq 0.5^2\}\), the bounding box can be easily derived as \([0, 1]^2\). When the treatment boundary is more complex or unknown, then bounding box limits may be difficult to derive. If shape files are available for the treated/control regions, then the function st_bbox can be used to get bounding box limits. The st_bbox function takes a geometry and produces a list of limits for the bounding box around the geometry in the shape file. For instance, if the treated region is North Carolina, then the bounding box for this region can be found using the following code:

nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source 
#>   `C:\Users\User\AppData\Local\R\win-library\4.4\sf\shape\nc.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
nc <- st_union(nc)
bbox <- st_bbox(nc)
bbox
#>      xmin      ymin      xmax      ymax 
#> -84.32385  33.88199 -75.45698  36.58965

When the boundary of interest cannot be expressed in terms of a shape file, then the get_closest_points_bbox function (along with a boundary point cloud) can be used to define bounding box limits that can be passed directly into the density_disc_test function. For example, suppose the treated region is defined as: \[ \Omega = \{x_1, x_2 \in \mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq 0.5^2\} \]

and the control region is defined as:

\[ [0, 1]^2 \setminus \Omega \]

Further, we will assume that the bounds of the control region are unknown. To start, we first need to make a point cloud for the boundary of \(\Omega\).

set.seed(123)
n <- 1000
x1 <- runif(n)
x2 <- runif(n)
x <- cbind(x1, x2)

bndry_func <- function(X) {
  return((X[,1] - 0.5)^2 + (X[,2] - 0.5)^2 - 0.5^2)
}

cloud <- make_cloud(X = x, bndry_func = bndry_func, start_pt = c(1, 0), num_pts_discr = 1e4, tol = 1e-2,
                    verbose = FALSE)
plot(cloud, xlab = "x1", ylab = "x2", main = "Circle Boundary Point Cloud")

Now, I will use the get_closest_points_bbox function to get the bounding boxes around the treated and control regions. Since the limits of the control region may not be known by the user, the get_closest_points_bbox function will use the sample and point cloud to approximate the bounding box limits.

inside <- ifelse((x[,1] - 0.5)^2 + (x[,2] - 0.5)^2 <= 0.5^2, 1, 0)
df_bndry <- get_closest_points_bbox(cloud, x, inside)

lower_bounds0 <- df_bndry[["lower_bounds0"]]
lower_bounds1 <- df_bndry[["lower_bounds1"]]
upper_bounds0 <- df_bndry[["upper_bounds0"]]
upper_bounds1 <- df_bndry[["upper_bounds1"]]

cat("lower_bounds0:\n")
#> lower_bounds0:
print(lower_bounds0)
#>           x1           x2 
#> -0.009690445 -0.009328973

cat("\nlower_bounds1:\n")
#> 
#> lower_bounds1:
print(lower_bounds1)
#>           x1           x2 
#> -0.009690445 -0.009328973

cat("\nupper_bounds0:\n")
#> 
#> upper_bounds0:
print(upper_bounds0)
#>       x1       x2 
#> 1.009727 1.009749

cat("\nupper_bounds1:\n")
#> 
#> upper_bounds1:
print(upper_bounds1)
#>       x1       x2 
#> 1.009727 1.009749

The lists lower_bounds0 and lower_bounds1 give the lower bounds for each coordinate of the bounding box around the treated and control regions, respectively. Similarly, upper_bounds0 and upper_bounds1 give the upper bounds for each coordinate of the bounding box around the treated and control regions, respectively. In this example, the bounding boxes for both the treated and control regions are \([0, 1]^2\), so the upper bounds should be close to one and the lower bounds should be close to zero.

Now that the bounding boxes for the treated/control regions are defined, we will now need to define functions that determine whether an observation is in either a treated or control region. These functions should take a single vector as an input and return either TRUE or FALSE as output.

in_region1 <- function(X) {
  (X[1] - 0.5)^2 + (X[2] - 0.5)^2 <= 0.5^2
}

in_region0 <- function(X) {
  (X[1] - 0.5)^2 + (X[2] - 0.5)^2 > 0.5^2
}

Importantly, a point that is not classified as treated is not necessarily classified as control, and vice versa. Consequently, in_region1 is not always the complement of in_region0. With that, we are now ready to do the global density test.

Density Manipulation Testing with mrddGlobal

With all of the necessary components ready, we can now conduct a density test using the density_disc_test function. Here is an example using two running variables, where the boundary is defined by \(\min\{x_1, x_2\} = 0\). In this case, observations with \(x_1, x_2 \geq 0\) will be treated, while the others are not. Also, the density will be defined as follows:

\[ f_X(x) = \frac{1}{3} \begin{cases} X_1 + X_2, \quad X_1, X_2 \in [0, 1] \\ 1 - X_1, \quad X_1 \in [0, 1], X_2 \in [-1, 0] \\ 1 - X_2, \quad X_1 \in [-1, 0], X_2 \in [0, 1] \\ 1, \quad X_1, X_2 \in [-1, 0] \end{cases} \]

where \(X_1, X_2\) take values in \([-1, 1]\). Although this boundary is simple, let’s try using random sampling to find the closest boundary points and generate the necessary support data.

X <- gen_data_density(1, seed = 12345)
bndry_func <- function(X) {
  # To ensure that each running variable remains between -1 and 1, I will add a penalty 
  if (any(abs(X) > 1)) {
    return(999)
  } else {
    return(min(X))
  }
}

in_region1 <- function(x) {
  x[1] >= 0 && x[2] >= 0 && x[1] <= 1 && x[2] <= 1
}

in_region0 <- function(x) {
  !(x[1] >= 0 && x[2] >= 0) && abs(x[1]) <= 1 && abs(x[2]) <= 1
}

cloud <- make_cloud(X, bndry_func = bndry_func, num_pts_discr = 2e4, step_size = 1e-1, tol = 1e-3,
                    verbose = FALSE)
plot(cloud, xlab = "x1", ylab = "x2", main = "Min Boundary Point Cloud")

Using this boundary data, let’s now run the global density test. Here, the main estimate is expected to be around 1/3. Note that when the discretized boundary is used in the density_disc_test function, setting the option rand_mid_split = TRUE seems to produce more reliable results.

inside <- ifelse(X[,1] >= 0 & X[,2] >= 0, 1, 0)
df_bndry <- get_closest_points_bbox(cloud, X, inside)

results_gen <- density_disc_test(X, inside, df_bndry[["closest_points"]], in_region0,
                                 in_region1, df_bndry[["lower_bounds0"]], 
                                 df_bndry[["lower_bounds1"]], df_bndry[["upper_bounds0"]],
                                 df_bndry[["upper_bounds1"]], verbose = FALSE, 
                                 rand_mid_split = TRUE, num_splits = 1, mc_samples = 500)

results_gen
#> Global Density Test Results
#> ----------------------------------
#> Main Estimate            :  0.3155
#> Bias-Corrected Estimate  :  0.2627
#> Standard Error           :  0.1195
#> P-Value                  :  0.0132

Without the experimental splitting algorithm and with known closest boundary points and bounding box limits, we get

closest_points <- closest_boundary_point(X)
t_R <- system.time(
  results_gen <- density_disc_test(X, inside, closest_points, in_region0,
                                 in_region1, c(-1, -1), 
                                 c(0, 0), c(1, 1),
                                 c(1, 1), verbose = FALSE, num_splits = 1, mc_samples = 500)
)
results_gen
#> Global Density Test Results
#> ----------------------------------
#> Main Estimate            :  0.3325
#> Bias-Corrected Estimate  :  0.3277
#> Standard Error           :  0.1160
#> P-Value                  :  0.0024

The results here tell us that the integral over the absolute value of discontinuities in the density along the boundary is close to the expected 1/3. The p-value indicates that, at a 5% level, we can reject the null hypothesis that there are no discontinuities in the density of the running variables along the boundary.

For a boost in speed, the in_region0 and in_region1 functions can also be written in C++. Here is a modified version of the code above with C++ functions instead of R functions.

library(Rcpp)

cppFunction(
  'SEXP region1_to_xptr() {
      typedef bool (*region_ptr)(const NumericVector&);
      return XPtr<region_ptr>(new region_ptr(in_region1_cpp), true);
  }', 
  includes = '
  // C++ version of in_region1
  bool in_region1_cpp(const NumericVector& x) {
      return (x[0] >= 0.0 && x[1] >= 0.0 &&
              x[0] <= 1.0 && x[1] <= 1.0);
  }'
)

cppFunction(
  'SEXP region0_to_xptr() {
      typedef bool (*region_ptr)(const NumericVector&);
      return XPtr<region_ptr>(new region_ptr(in_region0_cpp), true);
  }', 
  includes = '
  // C++ version of in_region0
  bool in_region0_cpp(const NumericVector& x) {
      bool not_in_first_quadrant = !(x[0] >= 0.0 && x[1] >= 0.0);
      bool within_box = (std::abs(x[0]) <= 1.0 && std::abs(x[1]) <= 1.0);
      return not_in_first_quadrant && within_box;
  }'
)

ptr1 <- region1_to_xptr()
ptr0 <- region0_to_xptr()

t_cpp <- system.time(
  results_gen <- density_disc_test(X, inside, closest_points, ptr0,
                                 ptr1, c(-1, -1), 
                                 c(0, 0), c(1, 1),
                                 c(1, 1), verbose = FALSE, num_splits = 1, mc_samples = 500)
)
cat("R Runtime:\n")
#> R Runtime:
print(t_R)
#>    user  system elapsed 
#>   32.47    0.11   32.56
cat("\n")
cat("C++ Runtime:\n")
#> C++ Runtime:
print(t_cpp)
#>    user  system elapsed 
#>    1.80    0.02    1.81

References

Chiang, Harold D, Yu-Chin Hsu, and Yuya Sasaki. 2019. “Robust Uniform Inference for Quantile Treatment Effects in Regression Discontinuity Designs.” Journal of Econometrics 211 (2): 589–618.
Friedberg, Rina, Julie Tibshirani, Susan Athey, and Stefan Wager. 2020. “Local Linear Forests.” Journal of Computational and Graphical Statistics 30 (2): 503–17.
Samiahulin, Artem. 2026. “Global Testing in Multivariate Regression Discontinuity Designs.” arXiv Preprint arXiv:2602.03819. https://arxiv.org/abs/2602.03819.
Wen, Hongwei, and Hanyuan Hang. 2022. “Random Forest Density Estimation.” In International Conference on Machine Learning, 23701–22. PMLR.