--- title: "Subgroup Analysis with Multivariate Binary Outcomes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Subgroup Analysis with Multivariate Binary Outcomes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ```{r, echo=FALSE} biblio <- RefManageR::ReadBib("../inst/REFERENCES.bib", check = "error") ``` ## Introduction In many studies, treatment effects may depend on patient characteristics (e.g., age, disease severity). Ignoring this patient information can, among others, lead to: - **Masked treatment effects**: Specific subgroup effects cancel each other out - **Reduced precision**: Unexplained variability - **Limited generalizability**: Results only apply to the specific sample The `bglm()` function extends `bmvb()` by allowing for subgroup analysis while using full sample information through multinomial logistic regression. ## When to Use Subgroup Analysis Use `bglm()` when: - You want to estimate effects for specific subpopulations (e.g., elderly patients, severe cases) - You want to improve precision by explaining outcome variability - Treatment effects may vary by covariate level (interaction effects) Use `bmvb()` when: - No covariates are available or theoretically relevant ## Example: Clinical Trial with Age Effects We'll simulate a trial where treatment effectiveness varies by patient age. ### Generate Data ```{r setup, eval=TRUE} library(bmco) set.seed(2024) n <- 150 clinical_data <- data.frame( treatment = rep(c("standard", "new"), each = n/2), age = rnorm(n, mean = 55, sd = 12) ) # Treatment is more effective in younger patients clinical_data$pain_relief <- NA clinical_data$mobility_improved <- NA for (i in 1:nrow(clinical_data)) { is_new <- ifelse(clinical_data$treatment[i] == "new", 1, 0) age_centered <- (clinical_data$age[i] - 55) / 10 # Younger patients benefit more from new treatment (interaction effect) p_pain <- plogis(-0.5 + 0.8 * is_new - 0.3 * age_centered - 0.4 * is_new * age_centered) p_mobility <- plogis(-0.3 + 0.6 * is_new - 0.2 * age_centered - 0.3 * is_new * age_centered) clinical_data$pain_relief[i] <- rbinom(1, 1, p_pain) clinical_data$mobility_improved[i] <- rbinom(1, 1, p_mobility) } # View data head(clinical_data) summary(clinical_data$age) table(clinical_data$treatment) ``` ### Full sample analysis First, analyze treatment effect on full sample: ```{r full_sample, eval=TRUE} result_full_sample <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(-Inf, Inf), # Full sample test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) print(result_full_sample) summary(result_full_sample) ``` This gives an average effect across all ages in the sample. ### Subgroup Analysis Now estimate the treatment difference for people aged 60+: ```{r adjusted_analysis, results='hide', message=FALSE, eval=TRUE} result_subgroup <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(60, Inf), test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) ``` ```{r show_adjusted, eval=TRUE} print(result_subgroup) ``` This analysis shows the treatment difference for a part of the age distribution in the sample. ## Three Methods for Population Definition `bglm()` offers three ways to define the population of interest: ### 1. Value Method: Specific Covariate Level Estimate effect at a **single covariate value** (e.g., age = 60): ```{r value_method, results='hide', message=FALSE, eval=TRUE} result_age60 <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 60, # Effect for 60-year-olds test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) ``` ```{r show_value, eval=TRUE} cat("Effect for 60-year-olds:\n") cat(" Mean effect:", result_age60$delta$mean_delta, "\n") cat(" Posterior probability:", result_age60$delta$pop, "\n") ``` ### 2. Empirical Method: Observed Covariate Range Average effect over **observed covariate values** in a specified range: ```{r empirical_method, results='hide', message=FALSE, eval=TRUE} result_young <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(-Inf, 55), # Younger patients (age <= 55) test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) result_old <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(55, Inf), # Older patients (age > 55) test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) result_all_ages <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(-Inf, Inf), # Older patients (age > 55) test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) ``` ```{r show_empirical, eval=TRUE} cat("Effect in younger patients (age <= 55):\n") cat(" Mean effect:", result_young$delta$mean_delta, "\n") cat(" Posterior probability:", result_young$delta$pop, "\n\n") cat("Effect in older patients (age > 55):\n") cat(" Mean effect:", result_old$delta$mean_delta, "\n") cat(" Posterior probability:", result_old$delta$pop, "\n") cat("Effect in all patients:\n") cat(" Mean effect:", result_all_ages$delta$mean_delta, "\n") cat(" Posterior probability:", result_all_ages$delta$pop, "\n\n") ``` The empirical method uses the **actual distribution** of ages in the data within each range. ### 3. Analytical Method: Theoretical Covariate Distribution Average effect assuming covariate follows a **normal distribution**: ```{r analytical_method, results='hide', message=FALSE, eval=TRUE} result_analytical <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Analytical", x_def = c(40, 70), # Age range 40-70 test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) ``` ```{r show_analytical, eval=TRUE} cat("Effect for age 40-70 (analytical integration):\n") cat(" Mean effect:", result_analytical$delta$mean_delta, "\n") cat(" Posterior probability:", result_analytical$delta$pop, "\n") ``` The analytical method integrates over a **truncated normal distribution** fitted to the data. ### Choosing a Method | Method | When to use | Advantages | Disadvantages | |--------|-------------|------------|---------------| | **Value** | Specific subgroup (e.g., typical patient) | Simple, interpretable | Single point | | **Empirical** | Generalize to observed range | Uses actual data distribution | Limited to observed values, not suitable for small subgroups (low *n*) | | **Analytical** | Smooth extrapolation | Can extrapolate beyond data, no minimum subgroup size | Assumes normality | ## Comparing Results Across Methods ```{r compare_methods,eval=TRUE} cat("Mean treatment difference by method:\n") cat(" Value (age=60): ", result_age60$delta$mean_delta, "\n") cat(" Empirical (young): ", result_young$delta$mean_delta, "\n") cat(" Empirical (old): ", result_old$delta$mean_delta, "\n") cat(" Empirical (all): ", result_all_ages$delta$mean_delta, "\n") cat(" Analytical (40-70): ", result_analytical$delta$mean_delta, "\n") cat("Posterior Probabilities by method:\n") cat(" Value (age=60): ", result_age60$delta$pop, "\n") cat(" Empirical (young): ", result_young$delta$pop, "\n") cat(" Empirical (old): ", result_old$delta$pop, "\n") cat(" Empirical (all): ", result_all_ages$delta$pop, "\n") cat(" Analytical (40-70): ", result_analytical$delta$pop, "\n") ``` The differences reflect: 1. **Heterogeneous treatment effects**: More evidence in favor of the new treatment in younger patients 2. **Different target populations**: Each method answers a slightly different question 3. **Sample composition**: Empirical methods reflect the actual age distribution ## Understanding Regression Coefficients The model includes: - **Intercept**: Baseline log-odds for reference group (standard treatment, mean age) - **Group effect**: Main effect of treatment - **Age effect**: Effect of age (centered) - **Interaction**: How treatment effect varies with age ```{r show_coefficients, eval=TRUE} # Regression parameter estimates result_all_ages$estimates$b ``` These multinomial regression coefficients have no straightforward interpretation in terms of treatment effects. Therefore, (posterior samples of) regression coefficients are transformed to (posterior samples of) marginal success probabilities ($\mathbf{\theta}_{A}$, $\mathbf{\theta}_{B}$) and marginal probability differences ($\mathbf{\delta} = \mathbf{\theta}_{B} - \mathbf{\theta}_{A}$) to be used for further analysis and decision-making via the three abovementioned methods. ## Decision Rules All three decision rules work: ### All Rule ```{r rule_all, results='hide', message=FALSE,eval=TRUE} result_all <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, rule = "All", # Both outcomes must improve n_burn = 200, n_it = 500 ) ``` ```{r show_all, eval=TRUE} cat("Mean delta:", result_all$delta$mean_delta, "\n") cat("P(new treatment better on BOTH outcomes | age=55):", result_all$delta$pop, "\n") ``` ### Any Rule ```{r rule_any, results='hide', message=FALSE, eval=TRUE} result_any <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, rule = "Any", # At least one outcome must improve n_burn = 200, n_it = 500 ) ``` ```{r show_any, eval=TRUE} cat("Mean delta:", result_any$delta$mean_delta, "\n") cat("P(new treatment better on AT LEAST ONE outcome | age=55):", result_any$delta$pop, "\n") ``` ### Compensatory Rule ```{r rule_comp, results='hide', message=FALSE, eval=TRUE} result_comp <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, rule = "Comp", w = c(0.6, 0.4), # Pain relief weighted more heavily n_burn = 200, n_it = 500 ) ``` ```{r show_comp, eval=TRUE} cat("Mean delta:", result_comp$delta$mean_delta, "\n") cat("P(weighted improvement | age=55):", result_comp$delta$pop, "\n") cat("Weighted effect:", result_comp$delta$w_delta, "\n") ``` ## Practical Example: Subgroup Analysis Analyze treatment effects across age quartiles: ```{r subgroup_analysis, eval=TRUE} age_quartiles <- quantile(clinical_data$age, probs = c(0, 0.25, 0.5, 0.75, 1)) subgroup_results <- data.frame( age_group = c("Q1 (youngest)", "Q2", "Q3", "Q4 (oldest)"), age_range = paste0( round(age_quartiles[1:4]), "-", round(age_quartiles[2:5]) ), mean_delta1 = NA, mean_delta2 = NA, post_prob = NA ) for (i in 1:4) { result_q <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(age_quartiles[i], age_quartiles[i + 1]), rule = "All", n_burn = 1000, n_it = 3000 ) subgroup_results$mean_delta1[i] <- result_q$delta$mean_delta[1] subgroup_results$mean_delta2[i] <- result_q$delta$mean_delta[2] subgroup_results$post_prob[i] <- result_q$delta$pop } print(subgroup_results) ``` This reveals how treatment effectiveness varies across age groups. ## Discrete Covariates Covariates can also be categorical (automatically detected): ```{r discrete_covariate, eval=TRUE} # Add disease severity clinical_data$severity <- factor( sample(c("mild", "severe"), nrow(clinical_data), replace = TRUE) ) result_discrete <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "severity", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", # For discrete covariates, use Value method x_def = 0, rule = "All", n_burn = 200, n_it = 500 ) print(result_discrete) ``` ## Sample Size Considerations Subgroup analysis typically requires larger samples than `bmvb()`, especially when the age distribution in the sample is used (method ``Empirical``: Too small samples may lead to: - Unstable coefficient estimates - Wide credible intervals - Poor convergence For more information on sample size computations, see @Kavelaars2024. ## Specifying Prior Distributions Regression coefficients have a **multivariate normal prior**: $$\beta \sim N(\mu_0, \Sigma_0)$$ ### Default Priors ```{r default_priors, eval=TRUE} # Default for bglm/bglmm p <- 4 b_mu0 <- rep(0, p) # Zero prior mean b_sigma0 <- solve(diag(100, p)) # Small prior precision (large prior variance). Weakly informative. # solve() generates precision matrix from variance matrix ``` Where `p` is the number of fixed effects: - Intercept - Group effect - Covariate effect(s) - Interaction term(s) #### Custom Fixed Effects Priors **Example 1**: Informative prior favoring positive treatment effect ```{r custom_prior_1, results='hide', message=FALSE, warning=FALSE, eval=TRUE} # Assume treatment improves outcomes (positive coefficient on group) # For binary covariate with 4 parameters: Intercept, grp, x, grp:x custom_b_mu0 <- c(0, 0.5, 0, 0) # Expect positive group effect custom_b_sigma0 <- solve(diag(c(10, 5, 10, 10))) # More certainty on group effect result_informative <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, b_mu0 = custom_b_mu0, b_sigma0 = custom_b_sigma0, n_burn = 200, n_it = 500 ) ``` **Example 2**: Skeptical prior (no effect expected) ```{r custom_prior_2, results='hide', message=FALSE, warning=FALSE, eval=TRUE} # Skeptical about treatment effect - tight prior around zero skeptical_b_mu0 <- c(0, 0, 0, 0) # No effect expected skeptical_b_sigma0 <- solve(diag(c(1, 1, 1, 1))) # Small variance result_skeptical <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, b_mu0 = skeptical_b_mu0, b_sigma0 = skeptical_b_sigma0, n_burn = 200, n_it = 500 ) ``` ### Prior Sensitivity Analysis Compare results under different priors: ```{r prior_sensitivity, results='hide', message=FALSE, warning=FALSE, eval=TRUE} # Weakly informative (default) result_weak <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, b_mu0 = c(0, 0, 0, 0), b_sigma0 = solve(diag(10, 4)), test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) # More informative result_more_informative <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 55, b_mu0 = c(0, 0.3, 0, 0.5), b_sigma0 = solve(diag(c(1, 5, 1, 1))), test = "right_sided", rule = "All", n_burn = 200, n_it = 500 ) ``` ```{r compare_priors, eval=TRUE} cat("Prior Sensitivity:\n") cat(" Weakly informative:\n") cat(" Mean effect:", result_weak$delta$mean_delta, "\n\n") cat(" Posterior prob:", result_weak$delta$pop, "\n") cat(" More informative (favoring treatment):\n") cat(" Mean effect:", result_more_informative$delta$mean_delta, "\n") cat(" Posterior prob:", result_more_informative$delta$pop, "\n") cat(" Informative (favoring treatment):\n") cat(" Mean effect:", result_informative$delta$mean_delta, "\n") cat(" Posterior prob:", result_informative$delta$pop, "\n") cat(" Skeptical:\n") cat(" Mean effect:", result_skeptical$delta$mean_delta, "\n") cat(" Posterior prob:", result_skeptical$delta$pop, "\n") ``` **Interpretation**: - If results are **similar** → data dominate, prior has little influence - If results **differ substantially** → prior is influential ### Further Reading For Bayesian logistic regression priors: - Gelman, A. Jakulin, A., Pittau, M. G. & Su, Y-S (2008). A weakly informative default prior distribution for logistic and other regression models. *Annals of Applied Statistics, 2*(4), 1360-1383. \doi{10.1214/08-AOAS191}. ## MCMC Diagnostics The function internally checks the multivariate potential scale reduction factor (MPSRF) and warns if > 1.1. Additional MCMC diagnostics will be returned When `return_diagnostics = TRUE`. When `return_diagnostic_plots = TRUE`, `bglm()` returns diagnostic plots as well. ```{r convergence, eval = FALSE} result <- bglm( # ... arguments ... n_burn = 200, # Increase if convergence issues n_it = 500, return_diagnostics = TRUE, return_diagnostic_plots = TRUE ) ``` Signs of poor convergence: - Warning: "MCMC chains may not have converged" - Very wide standard errors - Unexpected posterior probabilities Solutions: 1. Increase `n_burn` (e.g., 5000) 2. Increase `n_it` (e.g., 10000) 3. Check for data issues (perfect separation, outliers) ## Comparison: `bmvb()` and `bglm()` ```{r comparison, eval=TRUE} # Multivariate Bernoulli-distribution: result_bmvb <- bmvb( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", y_vars = c("pain_relief", "mobility_improved"), n_it = 500 ) # Logistic regression, full range: result_bglm <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Empirical", x_def = c(-Inf, Inf), n_burn = 200, n_it = 500 ) cat("Comparison:\n") cat(" Multivariate Bernoulli:\n") cat(" Mean treatment difference: ", result_bmvb$delta$mean_delta, "\n") cat(" Posterior probability: ", result_bmvb$delta$pop, "\n") cat(" Logistic regression:\n ") cat(" Mean treatment difference: ", result_bglm$delta$mean_delta, "\n") cat(" Posterior probability: ", result_bglm$delta$pop, "\n") cat(" Difference:\n ") cat(" Treatment difference:", abs(result_bglm$delta$mean_delta - result_bmvb$delta$mean_delta), "\n") cat(" Posterior probability:", abs(result_bglm$delta$pop - result_bmvb$delta$pop), "\n") ``` When the full sample range of the covariate is used, results should be similar. ## Advanced: Extracting Predictions ```{r predictions, results='hide', message=FALSE, eval=TRUE} result_samples <- bglm( data = clinical_data, grp = "treatment", grp_a = "standard", grp_b = "new", x_var = "age", y_vars = c("pain_relief", "mobility_improved"), x_method = "Value", x_def = 65, n_burn = 200, n_it = 500, return_samples = TRUE ) ``` ```{r use_predictions, eval=TRUE} # Treatment effect samples for age = 65 delta_samples <- result_samples$samples$delta # Custom probability: effect on pain relief > 0.2 cat("P(effect on pain > 0.2 | age=65):", mean(delta_samples[, 1] > 0.2), "\n") # Credible interval ci <- apply(delta_samples, 2, quantile, probs = c(0.025, 0.975)) cat("\n95% Credible Intervals for age=65:\n") print(ci) ``` ## Summary `bglm()` extends Bayesian multivariate analysis by: 1. **Estimating subgroup effects** (Value, Empirical, Analytical methods) 2. **Handling interactions** between treatment and covariates 3. **Maintaining multivariate inference** (All, Any, Compensatory rules) Choose your analysis: - **`bmvb()`**: No covariates → Simple comparison - **`bglm()`**: Covariates → Subgroup comparison - **`bglmm()`**: Covariates and/or clustering → Full hierarchical model ## References For more details on statistical methodology or when using the `bglm()` function, please cite: `r capture.output(print(biblio["Kavelaars2024"], .opts = list(style = "markdown", bib.style = "authoryear", sorting = "ydnt")))`