--- title: "Getting Started with WMAP" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{WMAP} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) library(WMAP) ``` # Introduction WMAP (Weighted Multi-site Analysis Package) integrates evidence from multiple observational studies with different group compositions using balancing weights. # Installation ```{r install, eval=FALSE} # Install from CRAN install.packages("WMAP") # Load package library(WMAP) ``` # Quick Start Example ## Load Demo Data ```{r load-data} # Load demonstration dataset data(demo) # The demo data contains: # S - Study/site indicators (7 hospitals) # Z - Group indicators (2 groups: IDC vs ILC) # X - Patient covariates (clinical variables) # Y - Outcomes (8 gene expression measurements) # naturalGroupProp - Known population group proportions ``` ## Run Analysis ```{r basic-analysis} # Estimate causal effects result <- causal.estimate( S = S, # Study indicators Z = Z, # Group indicators X = X, # Covariates Y = Y, # Outcomes B = 20, # Bootstrap samples (use 100+ in practice) method = "IC", # Weighting method naturalGroupProp = naturalGroupProp, seed = 123 # For reproducibility ) ``` ## View Results ```{r results} # Print summary summary(result) # Visualize bootstrap ESS distribution plot(result) ``` # Understanding the Output **Effective Sample Size (ESS):** The **Percent ESS** indicates how much information remains after weighting: **Mean Differences:** Shows estimated differences between Group 1 and Group 2 with 95% confidence intervals. **SD Ratios:** Ratios of standard deviations between groups. # Advanced Features ## Choosing a Weighting Method WMAP provides three methods: - **IC (Integrative Combined)** - **IGO (Integrative Generalized Overlap)** - **FLEXOR (Flexible Overlap)** ```{r methods, eval=FALSE} # Compare methods result_ic <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp) result_igo <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp) result_flexor <- causal.estimate(S, Z, X, Y, B = 100, method = "FLEXOR", naturalGroupProp) # Compare ESS result_ic$percentESS result_igo$percentESS result_flexor$percentESS ``` ## Outcome Model By default, `causal.estimate` uses observed outcomes directly (`outcome_model = FALSE`). Set `outcome_model = TRUE` to fit a random forest on X, S, Z and substitute predicted outcomes for observed Y in all moment calculations (mean, SD, median). ```{r outcome-model, eval=FALSE} # Default: use observed outcomes directly result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp = naturalGroupProp) # RF predictions substituted for observed Y result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp = naturalGroupProp, outcome_model = TRUE) ``` When `outcome_model = TRUE`, a random forest is fit per outcome per bootstrap iteration, which increases runtime. ### Cross-fitting By default (`crossfit_folds = NULL`), the random forest is fit on the full sample. Set `crossfit_folds` to an integer >= 2 for K-fold cross-fitting: ```{r crossfit, eval=FALSE} # Enable 5-fold cross-fitting result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp = naturalGroupProp, outcome_model = TRUE, crossfit_folds = 5) ``` ## Bootstrap Options ### Adjust Bootstrap Sample Size ```{r bootstrap-size, eval=FALSE} # Minimum for stable CIs result <- causal.estimate(S, Z, X, Y, B = 50, method = "IC", naturalGroupProp) # Recommended for publication result <- causal.estimate(S, Z, X, Y, B = 200, method = "IC", naturalGroupProp) # High precision result <- causal.estimate(S, Z, X, Y, B = 500, method = "IC", naturalGroupProp) ``` **Trade-off:** More bootstrap samples = better precision but longer computation time ### Fast Point Estimates (No Bootstrap) Skip bootstrap for quick exploration: ```{r no-bootstrap, eval=FALSE} # Point estimates only (no confidence intervals) result <- causal.estimate(S, Z, X, Y, B = 0, method = "IC", naturalGroupProp) summary(result) # Shows point estimates without CIs ``` ### Parallel Bootstrap Speed up computation using multiple cores: ```{r parallel, eval=FALSE} # Install required packages first install.packages(c("future", "future.apply")) # Define n_cores based on your system (e.g., 4) result <- causal.estimate( S, Z, X, Y, B = 100, method = "IC", naturalGroupProp = naturalGroupProp, parallel = TRUE, n_cores = 4 ) ``` The parallelization strategy depends on the environment. On Mac/Linux, `multicore` is used: workers are created by duplicating the current R process, so data is already in memory and no copying is required. On Windows, `multisession` is used: each worker is a fresh R process that receives a copy of the data, which has higher startup overhead. RStudio disables `multicore` regardless of platform, so `multisession` will be used when running inside RStudio. For best performance on Mac/Linux, run your script from the terminal using `Rscript` or an interactive R session. # Diagnostic Warnings WMAP automatically checks for common problems and warns you: ## Low ESS Warning ```{r low-ess-demo, eval=FALSE} # Example warning: # Warning message: # Low effective sample size: 3.1% (threshold: 5%). ``` ## Extreme Weights Warning ```{r extreme-weights-demo, eval=FALSE} # Example warning: # Warning message: # Extreme weights: max/mean ratio = 45.2 (threshold: 20). ``` ## Positivity Violations ```{r positivity-demo, eval=FALSE} # Example warning: # Warning message: # Sparse study-group cells: 2 cell(s) with < 5 observations. # Example warning: # Warning message: # Near-zero propensity scores: 6.3% of subjects truncated. ``` # Common Workflows ## Standard Analysis Workflow ```{r standard-workflow, eval=FALSE} # 1. Load and check data data(demo) table(S, Z) # Check all cells have observations sum(is.na(cbind(S, Z, X, Y))) # Check for missing values # 2. Run analysis result <- causal.estimate( S, Z, X, Y, B = 100, method = "FLEXOR", naturalGroupProp = naturalGroupProp, seed = 123 ) # 3. Check diagnostics summary(result) plot(result) ``` ## Method Comparison Workflow ```{r comparison-workflow, eval=FALSE} # Compare all three methods set.seed(123) results_ic <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp) results_igo <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp) results_flexor <- causal.estimate(S, Z, X, Y, B = 100, method = "FLEXOR", naturalGroupProp) # Compare ESS cat("IC ESS:", results_ic$percentESS, "%\n") cat("IGO ESS:", results_igo$percentESS, "%\n") cat("FLEXOR ESS:", results_flexor$percentESS, "%\n") # Compare estimates summary(results_ic) summary(results_igo) summary(results_flexor) ``` ## Sensitivity Analysis Workflow ```{r sensitivity-workflow, eval=FALSE} # Test sensitivity to bootstrap sample size result_b50 <- causal.estimate(S, Z, X, Y, B = 50, method = "IGO", naturalGroupProp, seed = 123) result_b100 <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp, seed = 123) result_b200 <- causal.estimate(S, Z, X, Y, B = 200, method = "IGO", naturalGroupProp, seed = 123) # Compare CI widths summary(result_b50) summary(result_b100) summary(result_b200) ``` ## Working with Weights Only Extract balancing weights without running full causal analysis: ```{r weights-workflow, eval=FALSE} # Get weights weights <- balancing.weights( S, Z, X, method = "IC", naturalGroupProp = naturalGroupProp ) # Examine weight distribution summary(weights) # Access weight vector for custom analysis head(weights$wt.v) ``` # Best Practices **Always set a seed for reproducibility:** ```{r reproducibility, eval=FALSE} result1 <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp, seed = 123) result2 <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp, seed = 123) ``` **Check data quality before analysis:** ```{r data-quality, eval=FALSE} # Check for missing values sum(is.na(cbind(S, Z, X, Y))) # Ensure all study-group combinations have observations table(S, Z) ``` # Getting Help ```{r help, eval=FALSE} # Function documentation ?causal.estimate ?balancing.weights # Package citation citation("WMAP") ```