--- title: "Advanced Causal Analysis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Causal Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction This vignette demonstrates advanced usage of the `causaldef` package, focusing on diagnostics for unobserved confounding and sensitivity analysis. We utilize real data incorporated in the package to showcase these features. ```{r setup} library(causaldef) library(ggplot2) # DAG Helper plot_dag <- function(coords, edges, title = NULL) { edges_df <- merge(edges, coords, by.x = "from", by.y = "name") colnames(edges_df)[c(3,4)] <- c("x_start", "y_start") edges_df <- merge(edges_df, coords, by.x = "to", by.y = "name") colnames(edges_df)[c(5,6)] <- c("x_end", "y_end") ggplot2::ggplot(coords, ggplot2::aes(x = x, y = y)) + ggplot2::geom_segment(data = edges_df, ggplot2::aes(x = x_start, y = y_start, xend = x_end, yend = y_end), arrow = ggplot2::arrow(length = ggplot2::unit(0.3, "cm"), type = "closed"), color = "gray40", size = 1, alpha = 0.8) + ggplot2::geom_point(size = 14, color = "white", fill = "#E7B800", shape = 21, stroke = 1.5) + # different color ggplot2::geom_text(ggplot2::aes(label = name), fontface = "bold", size = 3, color = "black") + ggplot2::ggtitle(title) + ggplot2::theme_void(base_size = 14) + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold", margin = ggplot2::margin(b = 10))) + ggplot2::coord_fixed() } ``` # Unobserved Confounding and Negative Controls Unobserved confounding is a major threat to causal inference. The `causaldef` package implements "Negative Control Outcomes" to detect residual confounding. A negative control outcome is a variable that is affected by unobserved confounders but is known not to be causally affected by the treatment. We will use the `gene_perturbation` dataset provided in the package. This dataset comes from a CRISPR knockout experiment. * **Treatment (`knockout_status`)**: Whether a gene was knocked out (Treatment) or not (Control). * **Outcome (`target_expression`)**: Expression level of the target gene. * **Confounders**: Technical factors like `library_size` and `batch`. * **Negative Control (`housekeeping_gene`)**: A housekeeping gene that should *not* be affected by the specific knockout but shares technical confounders. ## Loading the Data ```{r load_data} data("gene_perturbation") head(gene_perturbation) ``` ## Running the Negative Control Diagnostic We first specify the causal model, explicitly identifying the negative control outcome. ```{r nc_spec} spec_nc <- causal_spec( data = gene_perturbation, treatment = "knockout_status", outcome = "target_expression", covariates = c("library_size", "batch"), negative_control = "housekeeping_gene" ) # Visualize the Structural Assumption coords <- data.frame( name = c("Unobserved", "Treatment", "Target", "Housekp"), x = c(0, -1.5, 1.5, 0), y = c(1, 0, 0, -1) ) edges <- data.frame( from = c("Unobserved", "Unobserved", "Unobserved", "Treatment"), to = c("Treatment", "Target", "Housekp", "Target") ) plot_dag(coords, edges, title = "Why Negative Controls Work:\nShared Unobserved Confounding") ``` Now we run the diagnostic using `nc_diagnostic()`. We test if the Inverse Probability of Treatment Weighting (IPTW) method successfully removes confounding bias. ```{r nc_run} set.seed(123) # Check if IPTW approach effectively removes confounding using the negative control nc_res <- nc_diagnostic(spec_nc, method = "iptw", n_boot = 100) print(nc_res) ``` **Interpreting Negative Control Diagnostics:** This diagnostic tests a crucial assumption: *"Did my adjustment remove ALL confounding?"* **Key Results to Check:** 1. **p_value**: - > 0.05: [PASS] – No evidence that treatment affects the negative control after adjustment - $\leq$ 0.05: [FAIL] – Treatment still associated with negative control -> residual confounding detected 2. **falsified**: - FALSE: Assumption not falsified -> Adjustment appears adequate - TRUE: Assumption falsified -> Likely unobserved confounding or model misspecification 3. **delta_nc**: - Magnitude of residual association with negative control - Smaller values indicate better performance - Should be close to zero if adjustment is successful 4. **delta_bound**: - Upper bound on your true deficiency: $\delta$ $\leq$ $\kappa$ $\times$ $\delta$_NC - Provides a worst-case estimate even when you can't directly measure $\delta$ **Scientific Implications:** The negative control is powerful because: - It doesn't rely on assumptions about what confounders exist - It provides empirical evidence (not just theoretical arguments) - Falsification is strong evidence – if it fails, your conclusions are suspect **If falsified:** - Do NOT ignore this result - Investigate: batch effects, technical artifacts, unmeasured biology - Consider: additional covariates, different methods, or acknowledging the limitation ## Addressing the Falsification If the diagnostic fails (as it might with unobserved batch effects not fully captured), it indicates that your current adjustment set is insufficient to block all back-door paths to the negative control (and likely the outcome). **Recommended Actions:** 1. **Enrich the Confounder Set**: If possible, measure and include additional potential confounders (e.g., more granular technical variables). 2. **Sensitivity Analysis**: Since you have evidence of unobserved confounding, use `confounding_frontier()` to determine how strong this confounding would need to be to invalidate your results. 3. **Alternative Designs**: Consider methods that rely on different assumptions, such as Instrumental Variables (if available) or Difference-in-Differences. 4. **Transparent Reporting**: If you cannot fix the issue, you must report the failure. "We detected residual confounding ($\delta_{nc} = ...$) using negative controls. Consequently, causal estimates interpretability is limited." # Sensitivity Analysis: Confounding Frontiers When we suspect unobserved confounding, we can quantify how sensitive our causal conclusions are to potential unmeasured confounders $U$. The `confounding_frontier()` function maps the "deficiency" (information loss) as a function of the strength of confounding paths $U \to A$ ($\alpha$) and $U \to Y$ ($\gamma$). We will demonstrate this using the `hct_outcomes` dataset (Hematopoietic Cell Transplantation). ```{r sensitivity_setup} data("hct_outcomes") # Convert treatment to numeric for the Gaussian model approximation hct_outcomes$conditioning_numeric <- as.numeric(hct_outcomes$conditioning_intensity) - 1 # Simplified specification focusing on conditioning intensity spec_sens <- causal_spec( data = hct_outcomes, treatment = "conditioning_numeric", outcome = "time_to_event", # Simplified for this example covariates = c("age", "disease_status") ) ``` We compute the confounding frontier. This does not require observing $U$, but rather exploring the space of hypothetical confounding strengths. ```{r run_frontier} frontier <- confounding_frontier( spec_sens, alpha_range = c(-1, 1), gamma_range = c(-1, 1), grid_size = 40 ) # The result contains a grid we can visualize head(frontier$grid) ``` ## Visualizing the Frontier We can visualize the regions where deficiency is high (problematic) vs low. ```{r plot_frontier, fig.width=6, fig.height=5} ggplot(frontier$grid, aes(x = alpha, y = gamma, fill = delta)) + geom_tile() + scale_fill_viridis_c(option = "magma", direction = -1) + labs( title = "Confounding Frontier", x = "Confounding Strength U -> A (alpha)", y = "Confounding Strength U -> Y (gamma)", fill = "Deficiency\n(Delta)" ) + theme_minimal() + geom_contour(aes(z = delta), color = "white", breaks = c(0.01, 0.05, 0.1)) ``` **Interpreting the Confounding Frontier:** This visualization maps the "danger zones" of unobserved confounding: **The Axes:** - **x-axis ($\alpha$)**: Strength of confounding U -> A (unobserved confounder affects treatment) - **y-axis ($\gamma$)**: Strength of confounding U -> Y (unobserved confounder affects outcome) - **Color/fill ($\delta$)**: Resulting deficiency (information loss) **Reading the Plot:** 1. **Dark regions (low $\delta$ $\approx$ 0)**: - "Safe zones" where causal identification is robust - Small amounts of unobserved confounding don't destroy identification - You can proceed with confidence in these scenarios 2. **Bright regions (high $\delta$ > 0.15)**: - "Danger zones" where unobserved confounding severely compromises identification - Large distributional distance from the ideal experiment (TV distance > 0.15) - Causal conclusions are unreliable without additional assumptions 3. **Contour lines** (white lines at $\delta$ = 0.01, 0.05, 0.1): - Boundaries between different risk levels - The "frontier" where identification breaks down **Scientific Decision-Making:** Use domain knowledge to assess plausibility: - **If you believe** |$\alpha$| < 0.3 and |$\gamma$| < 0.3 (weak confounding), and the plot shows low $\delta$ in that region -> Proceed with confidence - **If plausible scenarios** put you in high-$\delta$ regions -> Your conclusions are fragile, acknowledge this limitation - **If uncertain** about confounding strength -> Report results across the frontier to show sensitivity **Example Interpretation:** *"Our sensitivity analysis shows that if unobserved confounding U -> A and U -> Y are both weak (|$\alpha$|, |$\gamma$| < 0.2), the deficiency remains below 0.05, suggesting robust identification. However, if either confounding strength exceeds 0.5, identification breaks down ($\delta$ > 0.1). Given the biological plausibility of weak confounding in this system, we conclude our estimates are reasonably robust."* The confounding frontier shows the boundary: if we are outside the "safe" zone defined by the frontier, we cannot claim causal identification without further assumptions (see Akdemir 2026, DOI: 10.5281/zenodo.18367347 for theoretical details). # Conclusion The `causaldef` package provides tools not just for estimation, but for **falsification** and **sensitivity analysis**. 1. **Negative Controls** allow us to empirically test our "no unobserved confounding" assumption using auxiliary data. 2. **Confounding Frontiers** allow us to theoretically explore how robust our study is to potential unobserved confounders.