mrddGlobal: Global Testing for
Multivariate Regression Discontinuity DesignsThis vignette aims to explain how to use the mrddGlobal
package. I will first describe the basics of global testing in
multivariate regression discontinuity designs, and then describe how the
package can automatically handle global testing tasks. For more details,
see Samiahulin (2026).
To describe the idea behind global testing, I will first state the
general problem, and then describe the solution implemented by the
mrddGlobal package.
Let \(Y_i\) denote the outcome variable for individual \(i\) with potential outcomes \(Y_i(1), Y_i(0)\) and let \(X_i\) be a vector of running variables. Also, let \(\Omega\) denote the region of treated observations with boundary \(\partial \Omega\) and treatment indicator \(T_i\) being defined as follows: \[ T_i = \begin{cases} 1, \quad X_i \in \Omega \\ 0, \quad X_i \notin \Omega \end{cases} \] If \(\tau(x) = \mathbb{E}[Y_i(1) - Y_i(0) | X_i = x]\), the primary objective is to test the following hypothesis: \[\begin{align*} H_0 &: \tau(x) = 0 \text{ for all } x \in \partial \Omega \\ H_a &: \tau(x) \neq 0 \text{ for some }x \in \partial \Omega \end{align*}\] One common approach to testing this hypothesis is to estimate \(\tau(x)\) along a discretized version of \(\partial \Omega\) and apply uniform bands for inference. While this works well for large samples, this approach often fails for small samples. When the sample size is large, each boundary point typically has many neighboring observations, which makes estimation and inference reliable. With a small overall sample size, boundary points may not have many neighboring observations, making asymptotic approximations unreliable.
With that, the approach used in the mrddGlobal package
uses machine learning and distance-based aggregation to pool points
together for reliable global testing. More specifically, for the most
clean estimation and inference procedure, the following steps are
taken:
\[ \Gamma_i = \begin{cases} 1, \quad \tau(\gamma(X_i)) \geq 0 \\ 0, \quad \tau(\gamma(X_i)) < 0 \end{cases} \]
\[ \tau_\Gamma = \lim_{\epsilon \to 0^+} \mathbb{E}[(2\Gamma_i - 1) Y_i | g_i = \epsilon] - \lim_{\epsilon \to 0^-} \mathbb{E}[(2\Gamma_i - 1) Y_i | g_i = \epsilon] \]
If \(g(X_i)\) represents the signed distance between \(X_i\) and \(\gamma(X_i)\), \(\tau_\Gamma\) can also be rewritten as follows:
\[ \tau_\Gamma = \int_{x:g(x) = 0} |\tau(x)| \left( \frac{f_X(x)}{\int_{x: g(x) = 0} f_X(x) dS_\Omega} \right) dS_\Omega \]
where \(S_\Omega\) is a Hausdorff
measure over \(\partial \Omega\).
Intuitively, the new univariate test statistic represents a
density-weighted average of \(|\tau(x)|\) over the boundary of the
treated region. This means that if \(\tau_\Gamma > 0\), then \(\tau(x) \neq 0\) for some \(x \in \partial \Omega\). However, if \(\tau_\Gamma = 0\), then \(\tau(x) = 0\) for all \(x \in \partial \Omega\). The global testing
procedure above is implemented via the cef_disc_test
function in the mrddGlobal package.
mrddGlobalTo install the mrddGlobal package, run the following
code:
To load the package, use the following:
To start using the package, the following are required for all tests:
X), where rows represent
observations and columns represent running variables.t).closest_points_all).For the boundary treatment effect heterogeneity test, a vector of
outcomes (Y) will also be needed. For the density test,
information about the support of your data will be required (more
details on this will be given later in this document).
In all tests, the user will need to find the closest boundary point for each individual’s collection of running variables. Here, I will discuss three ways one can go about defining this matrix.
If the boundary of the treated region \(\partial \Omega\) can be easily expressed mathematically, then one can find the closest boundary points by solving the following, where \(\gamma_i^*\) is the closest boundary point to \(X_i\) and \(||\cdot||_2\) is the \(L_2\) norm:
\[ \gamma_i^* = \operatorname*{argmin}_{\gamma \in \partial \Omega} ||X_i - \gamma||_2 \]
For instance, in many multivariate RDD settings, the boundary can be expressed by the set of points where \(\min\{X_1, X_2, ..., X_d\} = 0\). This happens in settings where treatment is assigned to observations where all running variables are higher than a certain threshold. Some examples include students passing multiple exams to get into a program or municipalities passing certain pre-defined thresholds to be eligible for funding. In these cases, the \(j^{\text{th}}\) coordinate of \(\gamma_i\) is given as follows:
\[ \gamma_{ij}^* = \begin{cases} 0, \quad X_{ij} > 0 \quad \forall j, \quad X_{ij} = \min_j\{X_{ij}\} \\ X_{ij}, \quad X_{ij} > 0 \quad \forall j, \quad X_{ij} \neq \min_j\{X_{ij}\} \\ \max\{X_{ij}, 0\}, \quad \exists k \text{ s.t. } X_{ik} < 0 \end{cases} \]
In some cases, finding the closest boundary point analytically can be challenging. This is especially true in cases where the boundary of the treated region is geographic and complex. In this case, a shape file can be used to represent the treated region and find the boundary points. For instance, here is a shape file for North Carolina with some randomly-generated observations:
library(sf)
library(ggplot2)
# Loading in shape file data
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source
#> `C:\Users\User\AppData\Local\R\win-library\4.4\sf\shape\nc.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
nc <- st_union(nc)
# Generating points in and out of NC
bbox <- st_bbox(nc)
n <- 2000
pts <- data.frame(
x = runif(n, bbox["xmin"], bbox["xmax"]),
y = runif(n, bbox["ymin"], bbox["ymax"])
)
knitr::kable(head(pts))| x | y |
|---|---|
| -82.61074 | 34.53275 |
| -79.80422 | 34.15976 |
| -78.94235 | 33.93203 |
| -79.60517 | 34.25315 |
| -78.00949 | 34.87519 |
| -83.58188 | 34.44275 |
Here is a plot of the points around North Carolina:
pts_sf <- st_as_sf(pts, coords = c("x", "y"), crs = st_crs(nc))
inside <- st_within(pts_sf, nc, sparse = FALSE)[,1]
pts_sf$status <- ifelse(inside, "Inside NC", "Outside NC")
ggplot() +
geom_sf(data = nc, fill = NA, color = "black", linewidth = 0.6) +
geom_sf(data = pts_sf, aes(color = status), alpha = 0.6, size = 1) +
coord_sf() +
scale_color_manual(values = c("Inside NC" = "blue", "Outside NC" = "red")) +
theme_minimal() +
labs(
title = "Random Points Inside and Outside North Carolina",
color = ""
)From here, we can use the get_closest_points_shp
function to easily find the closest boundary point using our shape file
and running variable matrix.
closest_points_all <- get_closest_points_shp(X = pts, shp_file = nc)
knitr::kable(head(closest_points_all))| X | Y |
|---|---|
| -82.69736 | 35.09123 |
| -79.34456 | 34.53283 |
| -78.78505 | 34.05906 |
| -79.30294 | 34.49778 |
| -77.60973 | 34.43504 |
| -83.58110 | 34.98957 |
Here is a plot of the closest boundary points:
closest_points_geo <- st_as_sf(as.data.frame(closest_points_all), coords = c("X", "Y"), crs = 4326)
ggplot() +
geom_sf(data = nc, fill = NA, color = "black", linewidth = 0.6) +
geom_sf(data = closest_points_geo, alpha = 0.6, size = 1, color = "purple") +
coord_sf() +
theme_minimal() +
labs(
title = "Closest Boundary Points for North Carolina Observations",
color = ""
)When the boundary of the treated region is complex and not easy to use, a third option for (approximately) finding the closest boundary point is to discretize the boundary and find the nearest neighbor between each observation and the discretized boundary (also known as a point cloud). For instance, consider the boundary defined by the following equation with three running variables:
\[ \sum_{j=1}^3 x_j^2 + 0.5\sin\left( \sum_{j=1}^3 x_j^4 \right) = 1 \]
Here is a plot of this boundary:
This boundary is not easy to use for an analytical solution because
of the sin term. On top of that, this boundary is also difficult to
express in terms of a shape file because of the curvature and dimension.
So, to approximate the boundary, I will create a discretized version of
the boundary using the make_cloud function (note that this
function may take a few minutes to run):
# First, I will create a simulated dataframe
set.seed(1)
n <- 2000
df <- matrix(runif(n * 3, -1.2, 1.2), ncol = 3)
# Then I will define the function that determines the boundary
bndry_func <- function(X) {
rowSums(X^2) + 0.5 * sin(rowSums(X^4)) - 1
}
# Finally, I will get just the point cloud from the make_cloud function
cloud <- make_cloud(X = df, bndry_func = bndry_func, num_pts_discr = 2e4, step_size = 1e-1, tol = 1e-2,
verbose = FALSE)
knitr::kable(head(cloud))| 0.7668274 | 0.3885158 | -0.2625383 |
| 0.8199770 | 0.2366764 | -0.2318826 |
| 0.8035395 | 0.2787459 | -0.2719072 |
| 0.7505699 | 0.3527048 | -0.3782530 |
| 0.6898971 | 0.4623688 | -0.4030039 |
| 0.7390081 | 0.4449633 | -0.3068749 |
The make_cloud function works to discretize the boundary
by starting at point start_pt. Then, a new candidate point
is selected by perturbing the starting point coordinates with normal
noise with standard deviation step_size. If the new point
is within tol distance of the actual boundary (as
determined by bndry_func), then this new point is saved as
a boundary point. If the new point is closer to the boundary than the
current point, then start the search for a boundary point at the new
point. If the new point is farther from the boundary than the current
point, then discard the new point. This process is repeated until
num_pts_discr boundary points are found.
If step_size is chosen too small, the algorithm
implemented in make_cloud may terminate before covering the
entire boundary. Visualizing the resulting point cloud provides a simple
diagnostic to verify that all boundary regions are adequately
represented. From here, to get the closest boundary points, we can use
the get_closest_points_bbox function.
treated <- ifelse(bndry_func(df) <= 0, 1, 0)
df_bndry <- get_closest_points_bbox(cloud = cloud, X = df, t = treated)
closest_points_all <- df_bndry[["closest_points"]]
knitr::kable(head(closest_points_all))| -0.4284417 | 0.6468775 | -0.5179465 |
| -0.2523814 | 0.8377598 | 0.0119770 |
| 0.1452603 | 0.5962210 | -0.6817302 |
| 0.8574082 | -0.1300664 | 0.0200559 |
| -0.4488960 | -0.5057970 | 0.6435655 |
| 0.6264579 | -0.6487247 | -0.1714592 |
To implement the global treatment effect heterogeneity test, we’ll first need to set up our data. Here is an example using two running variables, where the boundary is defined by \(\min\{x_1, x_2\} = 0\). In this case, observations with \(x_1, x_2 \geq 0\) will be treated, while the others are not. Also, the outcome will be defined as follows:
\[ Y = \frac{1}{3} \begin{cases} X_1 + X_2, \quad X_1, X_2 \in [0, 1] \\ 1 - X_1, \quad X_1 \in [0, 1], X_2 \in [-1, 0] \\ 1 - X_2, \quad X_1 \in [-1, 0], X_2 \in [0, 1] \\ 1, \quad X_1, X_2 \in [-1, 0] \end{cases} + \epsilon \]
where \(\epsilon \sim N(0, \sigma_\epsilon^2)\), \(X_1, X_2\) take values in \([-1, 1]\), and \(f_X(x) = 0.25\) with \(X_1\) independent of \(X_2\). For the boundary heterogeneity test, the target parameter is given as follows:
\[ \tau_\Gamma = \int_{x:g(x) = 0} |\tau(x)| \left( \frac{f_X(x)}{\int_{x: g(x) = 0} f_X(x) dS_\Omega} \right) dS_\Omega \]
For the data generated above, \(\tau_\Gamma = 1/6\). The code for generating this data is below:
# Closest boundary points are found analytically since the boundary is simple
closest_boundary_point <- function(df) {
# Separating individual coordinates and initializing output matrix
x1 <- df[, 1]
x2 <- df[, 2]
result <- matrix(0, nrow = nrow(df), ncol = 2)
# Case 1: both coordinates are negative
# Closest boundary point is the origin
mask1 <- x1 < 0 & x2 < 0
result[mask1, ] <- cbind(0, 0)
# Case 2: first coordinate is positive and larger than the second coordinate
# Closest boundary point is either directly below or above on the x axis
mask2 <- x1 >= 0 & x1 > x2 & !mask1
result[mask2, ] <- cbind(x1[mask2], 0)
# Case 3: second coordinate is positive and larger than the first coordinate
# Closest boundary point is either directly left or right on the y axis
mask3 <- x2 >= 0 & x1 < x2 & !mask1
result[mask3, ] <- cbind(0, x2[mask3])
return(result)
}
sim_data <- gen_data_cef(1, n = 1000, seed = 1)
X <- as.matrix(sim_data[,1:2])
Y <- as.matrix(sim_data[,3])
t <- ifelse(X[,1] >= 0 & X[,2] >= 0, 1, 0)
closest_points_all <- closest_boundary_point(X)Now that the necessary data is prepared, the treatment effect
heterogeneity test can be performed using the cef_disc_test
function from the mrddGlobal package:
results <- cef_disc_test(X, Y, t, closest_points_all, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate : 0.2083
#> Bias-Corrected Estimate : 0.1830
#> Standard Error : 0.0849
#> P-Value : 0.0164The first number (tau) gives the estimate of \(\tau_\Gamma\), which is around the expected
1/6. The second number (tau_bc) gives the bias-corrected
estimate used for inference. The third number (se_tau)
gives the standard error for the bias-corrected estimate, while the
fourth number (pval) gives the p-value for the treatment
effect heterogeneity test. More specifically, tau estimates
a density-weighted average of the absolute value of the treatment
effects along the boundary between treated and control regions. The
p-value tells us that, at the 5% level, we can reject the null
hypothesis that treatment effects are zero along the entire
boundary.
By default, the cef_disc_test function only performs one
split in the K-fold cross-fitting procedure. To perform more splits and
improve the reliability of the hypothesis test, more splits can be added
using the num_splits argument.
results <- cef_disc_test(X, Y, t, closest_points_all, num_splits = 10, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate : 0.1917
#> Bias-Corrected Estimate : 0.1971
#> Standard Error : 0.0641
#> P-Value : 0.0008To see some of the split- and fold-level bandwidths and estimates, we
can use the summary method. The full bandwidth and estimate matrices can
be extracted from the output of cef_disc_test:
summary(results)
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate : 0.1917
#> Bias-Corrected Estimate : 0.1971
#> Standard Error : 0.0641
#> P-Value : 0.0008
#>
#> Split- and fold-level results:
#> Bandwidths (Left) Bandwidths (Right) Main Estimates
#> Split 1, Fold 1 0.37523 0.37523 0.25214
#> Split 1, Fold 2 0.94176 0.94176 0.16439
#> Split 2, Fold 1 0.77495 0.77495 0.21673
#> Split 2, Fold 2 0.58620 0.58620 0.11791
#> Split 3, Fold 1 0.48849 0.48849 0.27545
#> Split 3, Fold 2 0.58963 0.58963 0.14853
#> Split 4, Fold 1 0.91257 0.91257 0.24796
#> Split 4, Fold 2 0.37659 0.37659 0.20019
#> Split 5, Fold 1 0.48951 0.48951 0.22015
#> Split 5, Fold 2 0.62055 0.62055 0.13772
#> Bias-Corrected Estimates
#> Split 1, Fold 1 0.2323
#> Split 1, Fold 2 0.1336
#> Split 2, Fold 1 0.2057
#> Split 2, Fold 2 0.0858
#> Split 3, Fold 1 0.2522
#> Split 3, Fold 2 0.1508
#> Split 4, Fold 1 0.2134
#> Split 4, Fold 2 0.3039
#> Split 5, Fold 1 0.2366
#> Split 5, Fold 2 0.1461
#>
#> [showing first 10 of 20 rows]When adding more splits, both the main and bias-corrected estimates did not change much. However, adding more splits reduced the standard error and p-value associated with the hypothesis test. In other words, the main test statistic became more stable as the number of splits increased. Although the efficiency gains can be large at first, these improvements do not continue past a certain number of splits, as shown below:
results <- cef_disc_test(X, Y, t, closest_points_all, num_splits = 15, verbose = FALSE, num.trees = 5000, enable.ll.split = TRUE, bwselect = "diff")
results
#> Global Treatment Effect Heterogeneity Test Results
#> ----------------------------------
#> Main Estimate : 0.1966
#> Bias-Corrected Estimate : 0.1993
#> Standard Error : 0.0650
#> P-Value : 0.0006Before testing for treatment effects along \(\partial \Omega\), it may be interesting to assess the plausibility of running variable manipulation. To do this, we can check whether the joint density of the running variables is discontinuous along \(\partial \Omega\). More formally, define the following:
\[\begin{align*} f_X^+(x) & = \lim_{x' \rightarrow x, g_\Omega(x') \geq 0} f_X(x') , \quad f_X^-(x) = \lim_{x' \rightarrow x, g_\Omega(x') < 0} f_X(x') \\ \tau_f(x) & = f_X^+(x) - f_X^-(x) \end{align*}\]
In this case, the hypothesis of interest is as follows:
\[\begin{align*} H_0 &: \tau_f(x) = 0 \text{ for all } x \in \partial \Omega \\ H_a &: \tau_f(x) \neq 0 \text{ for some }x \in \partial \Omega \end{align*}\]
To test this hypothesis, let \(\Lambda_i\) be defined as follows:
\[ \Lambda_i = \begin{cases} 1, \quad \tau_f(X_i) \geq 0 \\ 0, \quad \tau_f(X_i) < 0 \end{cases} \]
Using a similar K-fold cross-fitting procedure from before, we can test the hypothesis above by testing for a discontinuity in the density of \(g_i^* = (2\Lambda_i - 1) g_i\) at zero. The size of the discontinuity is equivalent to the following:
\[ \int_{x: g(x) = 0} |\tau_f(x)| dS_\Omega \]
To estimate the discontinuity in the density of \(g_i^*\), we can plug in an estimated version of \(g_i^*\) (\(\hat{g}_i\)) use the following boundary bias-corrected kernel density estimator:
\[ \hat{f}_{g^*}(0)^+ - \hat{f}_{g^*}(0)^- = \sum_{i=1}^n \left( \frac{1}{nh_k} \right) \left(\alpha_1 + \alpha_2 \left| \frac{\hat{g}_{i}}{h_k} \right| \right) K\left( \frac{\hat{g}_{i}}{h_k} \right) (2\delta_i-1) \]
where \(\delta_i = 1\) if \(\hat{g}_i \geq 0\) (zero otherwise), \(h_k\) is the bandwidth used for observations in fold \(k\), \(K(\cdot)\) is a kernel function (Epanechnikov kernel for this package), and \(\alpha_1, \alpha_2\) solve the following system of equations:
\[ \begin{pmatrix} \int_0^1 K(u) du & \int_0^1 u K(u) du \\ \int_0^1 u K(u) du & \int_0^1 u^2 K(u) du \end{pmatrix} \begin{pmatrix} \alpha_1 \\ \alpha_2 \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \]
The coefficients \(\alpha_1,
\alpha_2\) remove boundary bias and first-order bias, allowing
the new kernel to behave like a standard kernel at interior points. The
base kernel here will be an Epanechnikov kernel. This procedure is
automatically handled by the density_disc_test
function.
For the multivariate density test, a variation of the random forest density estimator (RFDE) of Wen and Hang (2022) is used to estimate \(\tau_f(\cdot)\). The full algorithm goes as follows for either the treated or control region:
Define a bounding box
Construct the smallest axis-aligned rectangle that encloses the region
of interest.
Choose a split direction
Select one coordinate axis uniformly at random.
Split the region
Along the selected axis, split the current rectangle at its midpoint.
Alternatively, one may draw two points independently from a uniform
distribution over the rectangle’s extent along the chosen axis and split
at their average (experimental option).
Grow the tree recursively
Repeat Steps 2–3 on each resulting sub-rectangle until a user-specified
maximum tree depth is reached.
Estimate support coverage
For each terminal rectangle (leaf), estimate the fraction of its volume
that lies within the support of the data. This fraction is approximated
by sampling points uniformly within the rectangle and computing the
proportion that fall inside the support.
Estimate leaf densities
Estimate the density within each leaf by dividing the proportion of
observations assigned to that leaf by the volume of the leaf.
Aggregate across trees
Repeat Steps 1–6 independently \(T\)
times to obtain \(T\) tree-based
density estimators, and average these estimates to form the final random
forest density estimate.
As an example, suppose the treated region is defined as follows: \[ \Omega = \{x_1, x_2 \in \mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq 0.5^2\} \]
To estimate the density of the running variables at the boundary of this treated region, we start by defining a bounding box (which in this case is \([0, 1]^2\)). Next, we split this box recursively until some desired depth is reached. Below is an example where the maximum depth is 3.
For the red evaluation point, the density is estimated as the fraction of observations in its leaf divided by the leaf’s effective volume. Since only part of the support intersects the leaf, we can approximate this volume by multiplying the area of the rectangular leaf by the fraction of the support within the cell. This fraction is computed by randomly generating points uniformly within the leaf and calculating the proportion that fall inside the support. For instance, here is an example using the top left cell.
To do the density test, there are two pieces of information that are required:
To get the bounding box limits for the treated/control regions, there
are three ways of deriving these limits. The first way is analytically,
where the bounds of the treated/control region are easy to derive from
the geometry of the given regions. For instance, for the treated region
defined by \(\Omega = \{x_1, x_2 \in
\mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq 0.5^2\}\), the
bounding box can be easily derived as \([0,
1]^2\). When the treatment boundary is more complex or unknown,
then bounding box limits may be difficult to derive. If shape files are
available for the treated/control regions, then the function
st_bbox can be used to get bounding box limits. The
st_bbox function takes a geometry and produces a list of
limits for the bounding box around the geometry in the shape file. For
instance, if the treated region is North Carolina, then the bounding box
for this region can be found using the following code:
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
#> Reading layer `nc' from data source
#> `C:\Users\User\AppData\Local\R\win-library\4.4\sf\shape\nc.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
nc <- st_union(nc)
bbox <- st_bbox(nc)
bbox
#> xmin ymin xmax ymax
#> -84.32385 33.88199 -75.45698 36.58965When the boundary of interest cannot be expressed in terms of a shape
file, then the get_closest_points_bbox function (along with
a boundary point cloud) can be used to define bounding box limits that
can be passed directly into the density_disc_test function.
For example, suppose the treated region is defined as: \[
\Omega = \{x_1, x_2 \in \mathbb{R} : (x_1 - 0.5)^2 + (x_2 - 0.5)^2 \leq
0.5^2\}
\]
and the control region is defined as:
\[ [0, 1]^2 \setminus \Omega \]
Further, we will assume that the bounds of the control region are unknown. To start, we first need to make a point cloud for the boundary of \(\Omega\).
set.seed(123)
n <- 1000
x1 <- runif(n)
x2 <- runif(n)
x <- cbind(x1, x2)
bndry_func <- function(X) {
return((X[,1] - 0.5)^2 + (X[,2] - 0.5)^2 - 0.5^2)
}
cloud <- make_cloud(X = x, bndry_func = bndry_func, start_pt = c(1, 0), num_pts_discr = 1e4, tol = 1e-2,
verbose = FALSE)
plot(cloud, xlab = "x1", ylab = "x2", main = "Circle Boundary Point Cloud")Now, I will use the get_closest_points_bbox function to
get the bounding boxes around the treated and control regions. Since the
limits of the control region may not be known by the user, the
get_closest_points_bbox function will use the sample and
point cloud to approximate the bounding box limits.
inside <- ifelse((x[,1] - 0.5)^2 + (x[,2] - 0.5)^2 <= 0.5^2, 1, 0)
df_bndry <- get_closest_points_bbox(cloud, x, inside)
lower_bounds0 <- df_bndry[["lower_bounds0"]]
lower_bounds1 <- df_bndry[["lower_bounds1"]]
upper_bounds0 <- df_bndry[["upper_bounds0"]]
upper_bounds1 <- df_bndry[["upper_bounds1"]]
cat("lower_bounds0:\n")
#> lower_bounds0:
print(lower_bounds0)
#> x1 x2
#> -0.009690445 -0.009328973
cat("\nlower_bounds1:\n")
#>
#> lower_bounds1:
print(lower_bounds1)
#> x1 x2
#> -0.009690445 -0.009328973
cat("\nupper_bounds0:\n")
#>
#> upper_bounds0:
print(upper_bounds0)
#> x1 x2
#> 1.009727 1.009749
cat("\nupper_bounds1:\n")
#>
#> upper_bounds1:
print(upper_bounds1)
#> x1 x2
#> 1.009727 1.009749The lists lower_bounds0 and lower_bounds1
give the lower bounds for each coordinate of the bounding box around the
treated and control regions, respectively. Similarly,
upper_bounds0 and upper_bounds1 give the upper
bounds for each coordinate of the bounding box around the treated and
control regions, respectively. In this example, the bounding boxes for
both the treated and control regions are \([0,
1]^2\), so the upper bounds should be close to one and the lower
bounds should be close to zero.
Now that the bounding boxes for the treated/control regions are
defined, we will now need to define functions that determine whether an
observation is in either a treated or control region. These functions
should take a single vector as an input and return either
TRUE or FALSE as output.
in_region1 <- function(X) {
(X[1] - 0.5)^2 + (X[2] - 0.5)^2 <= 0.5^2
}
in_region0 <- function(X) {
(X[1] - 0.5)^2 + (X[2] - 0.5)^2 > 0.5^2
}Importantly, a point that is not classified as treated is not
necessarily classified as control, and vice versa. Consequently,
in_region1 is not always the complement of
in_region0. With that, we are now ready to do the global
density test.
mrddGlobalWith all of the necessary components ready, we can now conduct a
density test using the density_disc_test function. Here is
an example using two running variables, where the boundary is defined by
\(\min\{x_1, x_2\} = 0\). In this case,
observations with \(x_1, x_2 \geq 0\)
will be treated, while the others are not. Also, the density will be
defined as follows:
\[ f_X(x) = \frac{1}{3} \begin{cases} X_1 + X_2, \quad X_1, X_2 \in [0, 1] \\ 1 - X_1, \quad X_1 \in [0, 1], X_2 \in [-1, 0] \\ 1 - X_2, \quad X_1 \in [-1, 0], X_2 \in [0, 1] \\ 1, \quad X_1, X_2 \in [-1, 0] \end{cases} \]
where \(X_1, X_2\) take values in \([-1, 1]\). Although this boundary is simple, let’s try using random sampling to find the closest boundary points and generate the necessary support data.
X <- gen_data_density(1, seed = 12345)
bndry_func <- function(X) {
# To ensure that each running variable remains between -1 and 1, I will add a penalty
if (any(abs(X) > 1)) {
return(999)
} else {
return(min(X))
}
}
in_region1 <- function(x) {
x[1] >= 0 && x[2] >= 0 && x[1] <= 1 && x[2] <= 1
}
in_region0 <- function(x) {
!(x[1] >= 0 && x[2] >= 0) && abs(x[1]) <= 1 && abs(x[2]) <= 1
}
cloud <- make_cloud(X, bndry_func = bndry_func, num_pts_discr = 2e4, step_size = 1e-1, tol = 1e-3,
verbose = FALSE)
plot(cloud, xlab = "x1", ylab = "x2", main = "Min Boundary Point Cloud")Using this boundary data, let’s now run the global density test.
Here, the main estimate is expected to be around 1/3. Note that when the
discretized boundary is used in the density_disc_test
function, setting the option rand_mid_split = TRUE seems to
produce more reliable results.
inside <- ifelse(X[,1] >= 0 & X[,2] >= 0, 1, 0)
df_bndry <- get_closest_points_bbox(cloud, X, inside)
results_gen <- density_disc_test(X, inside, df_bndry[["closest_points"]], in_region0,
in_region1, df_bndry[["lower_bounds0"]],
df_bndry[["lower_bounds1"]], df_bndry[["upper_bounds0"]],
df_bndry[["upper_bounds1"]], verbose = FALSE,
rand_mid_split = TRUE, num_splits = 1, mc_samples = 500)
results_gen
#> Global Density Test Results
#> ----------------------------------
#> Main Estimate : 0.3155
#> Bias-Corrected Estimate : 0.2627
#> Standard Error : 0.1195
#> P-Value : 0.0132Without the experimental splitting algorithm and with known closest boundary points and bounding box limits, we get
closest_points <- closest_boundary_point(X)
t_R <- system.time(
results_gen <- density_disc_test(X, inside, closest_points, in_region0,
in_region1, c(-1, -1),
c(0, 0), c(1, 1),
c(1, 1), verbose = FALSE, num_splits = 1, mc_samples = 500)
)
results_gen
#> Global Density Test Results
#> ----------------------------------
#> Main Estimate : 0.3325
#> Bias-Corrected Estimate : 0.3277
#> Standard Error : 0.1160
#> P-Value : 0.0024The results here tell us that the integral over the absolute value of discontinuities in the density along the boundary is close to the expected 1/3. The p-value indicates that, at a 5% level, we can reject the null hypothesis that there are no discontinuities in the density of the running variables along the boundary.
For a boost in speed, the in_region0 and
in_region1 functions can also be written in C++. Here is a
modified version of the code above with C++ functions instead of R
functions.
library(Rcpp)
cppFunction(
'SEXP region1_to_xptr() {
typedef bool (*region_ptr)(const NumericVector&);
return XPtr<region_ptr>(new region_ptr(in_region1_cpp), true);
}',
includes = '
// C++ version of in_region1
bool in_region1_cpp(const NumericVector& x) {
return (x[0] >= 0.0 && x[1] >= 0.0 &&
x[0] <= 1.0 && x[1] <= 1.0);
}'
)
cppFunction(
'SEXP region0_to_xptr() {
typedef bool (*region_ptr)(const NumericVector&);
return XPtr<region_ptr>(new region_ptr(in_region0_cpp), true);
}',
includes = '
// C++ version of in_region0
bool in_region0_cpp(const NumericVector& x) {
bool not_in_first_quadrant = !(x[0] >= 0.0 && x[1] >= 0.0);
bool within_box = (std::abs(x[0]) <= 1.0 && std::abs(x[1]) <= 1.0);
return not_in_first_quadrant && within_box;
}'
)
ptr1 <- region1_to_xptr()
ptr0 <- region0_to_xptr()
t_cpp <- system.time(
results_gen <- density_disc_test(X, inside, closest_points, ptr0,
ptr1, c(-1, -1),
c(0, 0), c(1, 1),
c(1, 1), verbose = FALSE, num_splits = 1, mc_samples = 500)
)
cat("R Runtime:\n")
#> R Runtime:
print(t_R)
#> user system elapsed
#> 32.47 0.11 32.56
cat("\n")
cat("C++ Runtime:\n")
#> C++ Runtime:
print(t_cpp)
#> user system elapsed
#> 1.80 0.02 1.81