---
title: "Cross-Fitting for Debiased Kernel Estimation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cross-Fitting for Debiased Kernel Estimation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


## The overfitting problem

The forest balance estimator uses the data twice: once to fit the random forest
that defines the kernel, and again to estimate the treatment effect using that
kernel.  This creates a subtle **overfitting bias** that persists even at large
sample sizes.

To see this, we compare the standard (no cross-fitting) estimator with small
leaf size against the cross-fitted estimator with adaptive leaf size and an
oracle that uses the true propensity scores.  We use $n = 10{,}000$ and
$p = 50$ covariates:


``` r
library(forestBalance)

set.seed(123)
nreps <- 2

res <- matrix(NA, nreps, 3,
              dimnames = list(NULL, c("No CF (mns=10)", "CF (default)", "Oracle IPW")))

for (r in seq_len(nreps)) {
  dat <- simulate_data(n = 5000, p = 50, ate = 0)

  # Standard: no cross-fitting, small min.node.size
  fit_nocf <- forest_balance(dat$X, dat$A, dat$Y,
                              cross.fitting = FALSE, min.node.size = 10,
                              num.trees = 500)
  res[r, "No CF (mns=10)"] <- fit_nocf$ate

  # Cross-fitted with adaptive leaf size (package default)
  fit_cf <- forest_balance(dat$X, dat$A, dat$Y, num.trees = 500)
  res[r, "CF (default)"] <- fit_cf$ate

  # Oracle IPW (true propensity scores)
  ps <- dat$propensity
  w_ipw <- ifelse(dat$A == 1, 1 / ps, 1 / (1 - ps))
  res[r, "Oracle IPW"] <- weighted.mean(dat$Y[dat$A == 1], w_ipw[dat$A == 1]) -
                           weighted.mean(dat$Y[dat$A == 0], w_ipw[dat$A == 0])
}
```


Table: n = 5,000, p = 50, true ATE = 0, 2 replications.

|Method         |   Bias|     SD|   RMSE|
|:--------------|------:|------:|------:|
|No CF (mns=10) | 0.1912| 0.0288| 0.1934|
|CF (default)   | 0.0021| 0.0314| 0.0315|
|Oracle IPW     | 0.0228| 0.0149| 0.0272|


The no-cross-fitting estimator with small leaves has substantial bias.  The
default cross-fitted estimator with adaptive leaf size achieves much lower bias
and RMSE.

## Cross-fitting details

The idea follows the **double/debiased machine learning** (DML) framework of
Chernozhukov et al. (2018), adapted to kernel energy balancing.


Let $K$ denote the $n \times n$ proximity kernel built from a random forest
trained on the full sample $(X, A, Y)$.  The kernel captures which observations
are "similar" in terms of confounding structure.  However, because $K$ was
estimated from the same data used to compute the ATE, the kernel overfits:
it encodes information about the specific realisation of outcomes, not just
the confounding structure.  This creates a bias that does not vanish as
$n \to \infty$.

### K-fold cross-fitting

Given $K$ folds, the cross-fitted estimator proceeds as follows:

1. **Randomly partition** the data into $K$ roughly equal folds
   $F_1, \ldots, F_K$.

2. **For each fold** $k = 1, \ldots, K$:

   a. Train a `multi_regression_forest` on the data in folds $F_{-k}$ (all
      folds except $k$).
   b. Using this held-out forest, predict leaf node memberships for the
      observations in fold $F_k$.
   c. Build the proximity kernel $K_k$ ($n_k \times n_k$) from these
      out-of-sample leaf predictions.
   d. Compute kernel energy balancing weights $w_k$ for the observations in
      fold $F_k$ using $K_k$.
   e. Estimate the fold-level ATE via the Hajek estimator:
      $$\hat\tau_k = \frac{\sum_{i \in F_k} w_i A_i Y_i}
        {\sum_{i \in F_k} w_i A_i}
        - \frac{\sum_{i \in F_k} w_i (1-A_i) Y_i}
        {\sum_{i \in F_k} w_i (1-A_i)}.$$

3. **Average** the fold-level estimates (DML1):
   $$\hat\tau_{\mathrm{CF}} = \frac{1}{K} \sum_{k=1}^{K} \hat\tau_k.$$


## The role of leaf size

Cross-fitting alone is not sufficient to eliminate bias.  The **minimum leaf
size** (`min.node.size`) plays a crucial role:

- **Small leaves** (e.g., `min.node.size = 5`): the kernel is very granular,
  distinguishing observations at a fine scale.  But with out-of-sample
  predictions, small leaves lead to noisy similarity estimates---two
  observations may share a tiny leaf by chance rather than true similarity.

- **Large leaves** (e.g., `min.node.size = 50--100`): the kernel captures
  broader confounding structure.  Out-of-sample predictions are more stable
  because large leaves better represent population-level similarity.

The optimal leaf size scales with both $n$ (more data supports finer leaves)
and $p$ (more covariates require coarser leaves to avoid the curse of
dimensionality).  `forestBalance` uses an adaptive heuristic:

$$\mathrm{min.node.size} = \max\!\Big(20,\;\min\!\big(\lfloor n/200 \rfloor + p,\;\lfloor n/50 \rfloor\big)\Big).$$


``` r
set.seed(123)
nreps <- 2
node_sizes <- c(5, 10, 20, 50, 75, 100)
n <- 5000; p <- 50

leaf_res <- do.call(rbind, lapply(node_sizes, function(mns) {
  ates <- replicate(nreps, {
    dat <- simulate_data(n = n, p = p, ate = 0)
    forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
                   min.node.size = mns)$ate
  })
  data.frame(mns = mns, bias = mean(ates), sd = sd(ates),
             rmse = sqrt(mean(ates)^2 + var(ates)))
}))

heuristic <- max(20, min(floor(n / 200) + p, floor(n / 50)))
```


Table: Cross-fitted estimator (n = 5,000, p = 50, 2 reps). Arrow marks the adaptive default (mns = 75).

| min.node.size|    Bias|     SD|   RMSE|    |
|-------------:|-------:|------:|------:|:---|
|             5|  0.1716| 0.0134| 0.1721|    |
|            10|  0.1017| 0.0015| 0.1017|    |
|            20|  0.0153| 0.0447| 0.0472|    |
|            50|  0.0481| 0.0063| 0.0485|    |
|            75| -0.0099| 0.0351| 0.0364|<-- |
|           100|  0.0332| 0.0007| 0.0332|    |


Bias decreases with larger leaves until variance begins to dominate.  The
adaptive default balances bias reduction against variance.

## Practical usage

The default call uses cross-fitting with the adaptive leaf size:


``` r
dat <- simulate_data(n = 2000, p = 10, ate = 0)
fit <- forest_balance(dat$X, dat$A, dat$Y)
fit
#> Forest Kernel Energy Balancing (cross-fitted)
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Cross-fitting: 2 folds
#>   Solver: direct
#>   ATE estimate: -0.0315
#>   Fold ATEs: -0.0808, 0.0177
#>   ESS: treated = 529/745 (71%)   control = 878/1255 (70%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.
```

To disable cross-fitting (e.g., for speed or to inspect the kernel):


``` r
fit_nocf <- forest_balance(dat$X, dat$A, dat$Y, cross.fitting = FALSE)
fit_nocf
#> Forest Kernel Energy Balancing
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Solver: direct
#>   ATE estimate: -0.0215
#>   ESS: treated = 459/745 (62%)   control = 759/1255 (60%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.
```

To manually set the leaf size:


``` r
fit_custom <- forest_balance(dat$X, dat$A, dat$Y, min.node.size = 50)
fit_custom
#> Forest Kernel Energy Balancing (cross-fitted)
#> -------------------------------------------------- 
#>   n = 2,000  (n_treated = 745, n_control = 1255)
#>   Trees: 1000
#>   Cross-fitting: 2 folds
#>   Solver: direct
#>   ATE estimate: -0.0784
#>   Fold ATEs: -0.0918, -0.065
#>   ESS: treated = 424/745 (57%)   control = 739/1255 (59%)
#> -------------------------------------------------- 
#> Use summary() for covariate balance details.
```

## Choosing the number of folds

The default is `num.folds = 2` (sample splitting).  With two folds, each fold
retains half the observations, producing high-quality per-fold kernels.  More
folds train the forest on more data but produce smaller per-fold kernels.  Our
experiments show that the number of folds has a modest effect compared to the
leaf size.  Values of 2--5 all work well:


``` r
set.seed(123)
nreps <- 2
n <- 5000

fold_res <- do.call(rbind, lapply(c(2, 3, 5, 10), function(nfolds) {
  ates <- replicate(nreps, {
    dat <- simulate_data(n = n, p = 10, ate = 0)
    forest_balance(dat$X, dat$A, dat$Y, num.trees = 500,
                   num.folds = nfolds)$ate
  })
  data.frame(folds = nfolds, bias = round(mean(ates), 4),
             sd = round(sd(ates), 4),
             rmse = round(sqrt(mean(ates)^2 + var(ates)), 4))
}))
```


Table: Effect of number of folds (n = 5,000, adaptive leaf size).

| Folds|   Bias|     SD|   RMSE|
|-----:|------:|------:|------:|
|     2| 0.0907| 0.0456| 0.1015|
|     3| 0.0246| 0.0275| 0.0369|
|     5| 0.1025| 0.0096| 0.1030|
|    10| 0.2065| 0.1269| 0.2424|


## References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C.,
Newey, W. and Robins, J. (2018). Double/debiased machine learning for
treatment and structural parameters. *The Econometrics Journal*, 21(1),
C1--C68.

De, S. and Huling, J.D. (2025). Data adaptive covariate balancing for causal
effect estimation for high dimensional data. *arXiv preprint arXiv:2512.18069*.