---
title: "Automated Data Auditing for Causal Studies"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Automated Data Auditing for Causal Studies}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

Before diving into causal estimation, a critical but often overlooked step is **data auditing**: systematically checking which variables in your dataset might introduce bias, confounding, or estimation problems.

The `audit_data()` function in **causaldef** automates this process by testing each variable against the treatment and outcome to classify its causal role.

## Why Audit Your Data?

Traditional exploratory data analysis (EDA) tools check for:
- Missing values
- Distributional skew
- Outliers

But they miss **causal validity** issues:
- Which variables are confounders that MUST be adjusted for?
- Which variables are potential instruments?
- Which variables might serve as negative controls?
- Which variables are "leaky" and could bias your analysis?

Based on the manuscript's negative control certification/bounding logic (`thm:nc_bound`), `audit_data()` systematically evaluates each variable.

## Case Study: Right Heart Catheterization (RHC)

We'll demonstrate data auditing using the classic **RHC dataset** from Connors et al. (1996). This dataset contains 5,735 critically ill patients from 5 medical centers, with the treatment being Right Heart Catheterization (RHC) and the outcome being 30-day mortality.

This is an ideal case study because:
- **Medium size**: n = 5,735 patients
- **Many covariates**: p = 63 variables
- **Real confounding concerns**: RHC is not randomly assigned
- **Clinical importance**: Used extensively in causal inference literature

```{r setup}
library(causaldef)

# Load the RHC dataset
data(rhc)

cat("Dataset dimensions:", nrow(rhc), "patients,", ncol(rhc), "variables\n")
```

### Understanding the Research Question

**Treatment**: `swang1` - Whether the patient received Right Heart Catheterization (1 = Yes, 0 = No)

**Outcome**: `death` - 30-day mortality (Yes/No)

**Covariates**: Demographics, disease category, vital signs, lab values, comorbidities, etc.

```{r explore}
# Key variables
cat("Treatment distribution (swang1 = RHC):\n")
table(rhc$swang1)

cat("\nOutcome distribution (death):\n")
table(rhc$death)
```

## Running the Data Audit

Let's audit all available variables to understand which ones are most relevant for causal analysis.

```{r audit}
# Prepare data - convert death to numeric for auditing
rhc_clean <- rhc
rhc_clean$death_num <- as.numeric(rhc_clean$death == "Yes")

# Select relevant numeric and factor columns for audit
# (excluding IDs, dates, and the outcome/treatment themselves)
exclude_cols <- c("X", "ptid", "sadmdte", "dschdte", "dthdte", "lstctdte",
                  "swang1", "death", "death_num")
audit_cols <- setdiff(names(rhc_clean), exclude_cols)

# Run the audit
report <- audit_data(
  data = rhc_clean,
  treatment = "swang1",
  outcome = "death_num",
  covariates = audit_cols[1:25],  # First 25 covariates for demonstration
  alpha = 0.01,  # Stricter significance level
  verbose = FALSE
)

print(report)
```

### Interpreting the Report

The audit classifies each variable into one of these categories:

| Classification | Meaning | Action |
|----------------|---------|--------|
| **Confounder** | Correlates with BOTH treatment and outcome | MUST adjust for this |
| **Potential Instrument** | Correlates with treatment but NOT outcome | Consider for IV analysis |
| **Outcome Predictor** | Correlates with outcome but NOT treatment | Include for precision |
| **Safe** | No significant correlations | Can include or exclude |

## Examining Detected Issues

Let's look more closely at the flagged confounders:

```{r confounders}
# Filter to see only confounders
confounders <- report$issues[report$issues$issue_type == "Confounder", ]
if (nrow(confounders) > 0) {
  cat("Detected Confounders (must adjust for these):\n\n")
  print(confounders[, c("variable", "r_treatment", "r_outcome", "p_value")])
}
```

These variables show significant correlation with both the treatment decision (whether a patient receives RHC) and the outcome (mortality). Failing to adjust for these would introduce **confounding bias**.

## Clinical Interpretation

The audit results make clinical sense:

1. **Disease severity indicators** (like APACHE score `aps1`, vital signs) should correlate with both:
   - Sicker patients are more likely to receive RHC (treatment)
   - Sicker patients are more likely to die (outcome)

2. **Demographics** (age, comorbidities) follow similar patterns

3. **Some variables** correlate only with outcome (predictors of mortality but not of treatment selection)

## Comparing Audit Results Across Subsets

Let's see how the audit differs across patient subgroups:

```{r subgroup}
# Audit cardiac patients only
cardiac_patients <- rhc_clean[rhc_clean$card == 1, ]

if (nrow(cardiac_patients) > 50) {
  report_cardiac <- audit_data(
    data = cardiac_patients,
    treatment = "swang1",
    outcome = "death_num",
    covariates = audit_cols[1:15],
    alpha = 0.01,
    verbose = FALSE
  )
  
  cat("=== Cardiac Patients Subgroup ===\n")
  cat("Sample size:", nrow(cardiac_patients), "\n")
  cat("Issues found:", report_cardiac$summary_stats$n_issues, "\n")
  cat("Confounders:", report_cardiac$summary_stats$n_confounders, "\n")
}
```

## Using Audit Results for Causal Analysis

Once you've identified confounders through the audit, use them in your causal specification:

```{r spec}
# Get the list of detected confounders
confounder_vars <- report$issues$variable[report$issues$issue_type == "Confounder"]

# If we have confounders, build a proper causal specification
if (length(confounder_vars) > 0) {
  # Use detected confounders in causal spec
  spec <- causal_spec(
    data = rhc_clean,
    treatment = "swang1",
    outcome = "death_num",
    covariates = confounder_vars
  )
  
  print(spec)
}
```

## Full Audit Summary

```{r summary}
# Summary statistics from the audit
cat("\n=== Audit Summary ===\n")
cat("Variables audited:", report$summary_stats$n_vars_audited, "\n")
cat("Total issues:", report$summary_stats$n_issues, "\n")
cat("  - Confounders:", report$summary_stats$n_confounders, "\n")
cat("  - Potential instruments:", report$summary_stats$n_instruments, "\n")
```

## Best Practices for Data Auditing

1. **Run audit early**: Before any causal analysis, audit your data to understand variable roles

2. **Use domain knowledge**: The audit identifies statistical associations; combine with clinical/domain expertise

3. **Adjust significance level**: 
   - Use stricter `alpha` (e.g., 0.01) for larger datasets to reduce false positives
   - Use looser `alpha` (e.g., 0.10) for smaller samples to catch potential confounders

4. **Audit subgroups**: Confounding patterns may differ across patient populations

5. **Document decisions**: Record which variables you adjust for and why

6. **Iterate**: After initial analysis, re-audit to check if additional variables should be included

## Conclusion

The `audit_data()` function provides an automated first pass at identifying causal structure in your dataset. It answers key questions:

- **What must I adjust for?** → Confounders
- **What might help with identification?** → Potential instruments  
- **What improves precision?** → Outcome predictors
- **What's safe to ignore?** → Variables with no causal role

This systematic approach helps ensure your causal analysis is built on a solid foundation, reducing the risk of confounding bias and improving the reliability of your conclusions.

## References

- Connors, A.F. et al. (1996). The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients. *JAMA*, 276(11), 889-897.

- Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison. Negative control certification/bounding (`thm:nc_bound`).