--- title: "Generalizability Path Example: Characterizing Underrepresented Populations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Generalizability Path Example: Characterizing Underrepresented Populations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview A common challenge in translating evidence from randomized controlled trials (RCTs) to real-world practice is that trial participants may not reflect the broader target population. By definition in Parikh et al. 2025, subgroups that are "underrepresented" or "insufficiently represented" often occupy regions of the covariate space with heterogeneous treatment effects and insufficient representation in the trial data. If certain subgroups are underrepresented in the trial, estimates of the **Target Average Treatment Effect (TATE)** can be imprecise or misleading when transported to that population. The **Sample Average Treatment Effect (SATE)** is a finite sample equivalent version of the TATE. The resulting estimand from ROOT is the **Weighted Target Average Treatment Effect (WTATE)**: the average treatment effect restricted to the sufficiently represented subpopulation, estimated with lower variance than the unweighted TATE. This vignette walks through a complete generalizability analysis using the built-in `diabetes_data` dataset. --- ## The `diabetes_data` Dataset `diabetes_data` is a simulated dataset that mimics a diabetes intervention study. There are 2,000 individuals in a randomized controlled trial (RCT) sample, and there are 8,000 individuals in this simulated population we are making inferences to. ```{r load-data} library(ROOT) data(diabetes_data, package = "ROOT") str(diabetes_data) ``` The key columns are: | Column | Description | |:-------------|:-------------------------------------------------| | `Y` | Observed outcome (numeric) | | `Tr` | Treatment assignment (0 = control, 1 = treated) | | `S` | Sample indicator (1 = RCT, 0 = target population)| | `Age45` | Age ≥ 45 (binary indicator) | | `DietYes` | Currently on a diet programme (binary indicator) | | `Race_Black` | Race: Black (binary indicator) | | `Sex_Male` | Sex: Male (binary indicator) | ```{r explore-data} # How many trial vs target population units? table(S = diabetes_data$S) # Treatment breakdown within the trial table(Tr = diabetes_data$Tr[diabetes_data$S == 1]) ``` --- ## Checking Covariate Overlap Before running ROOT, it is good practice to check whether trial participants differ from the target population on key covariates. Systematic differences signal which subgroups may be underrepresented. ```{r overlap} # Mean of each covariate by S covariate_cols <- c("Age45", "DietYes", "Race_Black", "Sex_Male") overlap <- sapply(covariate_cols, function(v) { tapply(diabetes_data[[v]], diabetes_data$S, mean, na.rm = TRUE) }) knitr::kable( t(overlap), digits = 3, caption = "Covariate means by sample membership (S = 1: trial, S = 0: target)" ) ``` Differences across rows flag potential sources of underrepresentation that ROOT will attempt to characterize. --- ## Fitting ROOT in Generalizability Mode We use `characterizing_underrep()`, which is the high-level wrapper around `ROOT()` for generalizability/transportability analyses. It expects `data` to contain `Y`, `Tr`, and `S`, and internally: 1. Estimates transportability scores using logistic regression models (default) for $P(S = 1 \mid X)$ and $P(\text{Tr} = 1 \mid X, S = 1)$. 2. Constructs Horvitz–Thompson-style influence scores $v_i$. 3. Grows a forest of weighted trees that minimize the variance of the weighted estimator $\widehat{\text{WTATE}}$. 4. Selects a Rashomon set of the top-$k$ trees and aggregates their weight assignments by majority vote (default). 5. Fits a single summary tree characterizing the final $w_{\text{opt}}$ assignments. ```{r fit, message = FALSE, warning = FALSE} gen_fit <- characterizing_underrep( data = diabetes_data, generalizability_path = TRUE, num_trees = 20, top_k_trees = TRUE, k = 10, seed = 123 ) ``` --- ## Inspecting the Results ### Print summary ```{r print} print(gen_fit) ``` ### Detailed summary `summary()` additionally reports the Rashomon set size, the percentage of observations with $w_{\text{opt}} = 1$, and the unweighted and weighted estimands with their standard errors. ```{r summary} summary(gen_fit) ``` The **SATE** (unweighted) is the simple trial average treatment effect transported to the full target population. The **WTATE** (weighted) restricts this estimate to the well-represented subpopulation, where the trial provides more reliable evidence. A smaller standard error (SE) for the WTATE relative to the SATE reflects the variance reduction achieved by this restriction. ### Terminal node rules The `leaf_summary` component of the returned object gives an explicit human-readable rule for each terminal node of the summary tree, along with the number and percentage of observations in each leaf and whether they are classified as represented ($w = 1$) or underrepresented ($w = 0$). ```{r leaf-summary} gen_fit$leaf_summary ``` --- ## Visualizing the Characterization Tree `plot()` renders the final characterized tree from the Rashomon set. Blue leaves ($w = 1$) denote well-represented subgroups; orange leaves ($w = 0$) denote underrepresented subgroups. The percentage shown in each leaf is the share of trial units falling into that node. ```{r plot, fig.width = 7, fig.height = 5, fig.alt = "Characterized tree for diabetes generalizability analysis"} plot(gen_fit) ``` The tree reads top-down as a decision rule: starting from the root (all trial units), the first split separates subgroups that are wholly underrepresented from those that may be included. Follow the branches down to each leaf to read the complete inclusion/exclusion rule for that subgroup. --- ## Interpreting the Output From the characterized tree and leaf summary, we can describe the underrepresented subgroups in plain language: - **Black participants** are flagged as underrepresented ($w = 0$) regardless of other characteristics. - **Participants aged 45 or older who are not Black** are also underrepresented. - **Participants on a diet programme who are neither Black nor aged 45+** are underrepresented. - The remaining participants, those who are not Black, under 45, and not on a diet programme, form the **well-represented subpopulation** ($w = 1$) for whom the WTATE is estimated. The Rashomon set provides multiple near-optimal characterizations of these subgroups. The final summary tree aggregates across all trees in the set, giving a single interpretable rule. --- ## Key Parameters | Parameter | Role | Default | |:-----------------|:---------------------------------------------------------------------|:---------------| | `num_trees` | Number of trees to grow in the forest | `10` | | `top_k_trees` | If `TRUE`, select the top `k` trees by objective value | `FALSE` | | `k` | Rashomon set size when `top_k_trees = TRUE` | `10` | | `cutoff` | Rashomon threshold when `top_k_trees = FALSE`; `"baseline"` uses the objective at $w \equiv 1$ | `"baseline"` | | `vote_threshold` | Fraction of Rashomon-set trees that must vote $w = 1$ for a unit to be included | `2/3` | | `seed` | Random seed for reproducibility | `NULL` | | `feature_est` | Feature importance method used to bias split selection (`"Ridge"`, `"GBM"`, or a custom function) | `"Ridge"` | | `leaf_proba` | Controls tree depth by increasing the probability of stopping at a leaf | `0.25` | --- ## Reference Parikh, H., Ross, R. K., Stuart, E., & Rudolph, K. E. (2025). Who Are We Missing?: A Principled Approach to Characterizing the Underrepresented Population. *Journal of the American Statistical Association*, 120(551), 1414–1423.