---
title: "Multivariate Regression"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Multivariate Regression}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
## Use ragg for better font rendering if available
if (requireNamespace("ragg", quietly = TRUE)) {
  old_opts <- options(summata.use_ragg = TRUE, width = 180)
  knitr::opts_chunk$set(
    dev = "ragg_png",
    fig.retina = 1,
    collapse = TRUE,
    comment = "##>",
    message = FALSE,
    warning = FALSE,
    fig.width = 8,
    fig.height = 5,
    out.width = "80%"
  )
} else {
  old_opts <- options(width = 180)
  knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "##>",
    message = FALSE,
    warning = FALSE,
    fig.width = 8,
    fig.height = 5,
    out.width = "80%"
  )
}

## Dynamic figure sizing: queue_plot() stashes rec_dims from a plot object,
## and the opts_hook on the NEXT chunk (with use_rec_dims = TRUE) applies them
## before knitr opens the graphics device. Plots render via ragg (dev = "ragg_png"
## set above) and knitr captures them natively. No files written to disk.
.plot_dims <- new.env(parent = emptyenv())
.plot_dims$width <- NULL
.plot_dims$height <- NULL

knitr::opts_hooks$set(use_rec_dims = function(options) {
  if (isTRUE(options$use_rec_dims)) {
    if (!is.null(.plot_dims$width))  options$fig.width  <- .plot_dims$width
    if (!is.null(.plot_dims$height)) options$fig.height <- .plot_dims$height
    .plot_dims$width <- NULL
    .plot_dims$height <- NULL
  }
  options
})

## Call at the end of a plot-creation chunk to stash dimensions for the next chunk.
queue_plot <- function(plot) {
  dims <- attr(plot, "rec_dims")
  if (!is.null(dims)) {
    .plot_dims$width  <- dims$width
    .plot_dims$height <- dims$height
  }
  invisible(plot)
}
```

Multivariate regression refers to the simultaneous examination of a single independent predictor across multiple dependent variables. This approach inverts the typical regression paradigm: rather than testing multiple predictors against one outcome (as in `uniscreen()`, `fit()`, and `fullfit()`), multivariate regression tests the predictive value of a individual factor against multiple outcomes, with or without multivariable adjustment for other independent factors. This design is particularly useful when a key exposure or intervention must be evaluated against several endpoints simultaneously.

The `multifit()` function implements this workflow, supporting all model types available in `summata` with optional covariate adjustment, interaction terms, and mixed-effects specifications. Results can be visualized using `multiforest()` and exported using standard table export functions.

As with other `summata` functions, this function adheres to the standard calling convention:

```{r, eval = FALSE}
multifit(data, outcomes, predictor, covariates, model_type, ...)
```

where `data` is the dataset, `outcomes` is a vector of screened outcome variables, `predictor` is the predicting variable under examination, `covariates` is a vector of simultaneous independent predicting variables (i.e., potential confounders), and `model_type` is the type of model to be implemented. This vignette demonstrates the various capabilities of this function using the included sample dataset.

---

# Preliminaries

The examples in this vignette use the `clintrial` dataset included with `summata`:

```{r setup}
library(summata)
library(survival)
library(ggplot2)

data(clintrial)
data(clintrial_labels)
```

The `clintrial` dataset includes multiple outcome types suitable for multivariate analysis:
 
| Outcome Type | Variables | Model |
|:-------------|:----------|:------|
| Continuous | `los_days`, `pain_score`, `recovery_days` | `lm` |
| Binary | `any_complication`, `wound_infection`, `readmission_30d`, `icu_admission` | `glm` (binomial) |
| Count (equidispersed) | `fu_count` | `glm` (poisson) |
| Count (overdispersed) | `ae_count` | `negbin` or `quasipoisson` |
| Time-to-event | `Surv(pfs_months, pfs_status)`, `Surv(os_months, os_status)` | `coxph` |

> *n.b.:* To ensure correct font rendering and figure sizing, the forest plots below are displayed using a helper function (`queue_plot()`) that applies each plot's recommended dimensions (stored in the `"rec_dims"` attribute) via the [`ragg`](https://ragg.r-lib.org/) graphics device. In practice, replace `queue_plot()` with `ggplot2::ggsave()` using recommended plot dimensions for equivalent results:
> 
> ```{r, eval = FALSE}
> p <- glmforest(model, data = mydata)
> dims <- attr(p, "rec_dims")
> ggplot2::ggsave("forest_plot.png", p,
>                 width = dims$width, 
>                 height = dims$height)
> ```
> 
> This ensures that the figure size is always large enough to accommodate the constituent plot text and graphics, and it is generally the preferred method for saving forest plot outputs in `summata`.

---

# Basic Usage

## **Example 1:** Unadjusted Analysis

Test a single predictor against multiple binary outcomes:

```{r}
example1 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "surgery",
  labels = clintrial_labels,
  parallel = FALSE
)

example1
```

Each row shows the effect of the predictor on a specific outcome. For categorical predictors, each non-reference level appears separately.

## **Example 2:** Adjusted Analysis

Include covariates for confounding control:

```{r}
example2 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "surgery",
  covariates = c("age", "sex", "smoking", "diabetes"),
  labels = clintrial_labels,
  parallel = FALSE
)

example2
```

## **Example 3:** Unadjusted and Adjusted Comparison

Use `columns = "both"` to display unadjusted and adjusted results side-by-side:

```{r}
example3 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "surgery",
  covariates = c("age", "sex", "diabetes", "surgery"),
  columns = "both",
  labels = clintrial_labels,
  parallel = FALSE
)

example3
```

Comparing columns reveals confounding (large differences) or robust associations (similar estimates).

---

# Predictor Types

## **Example 4:** Continuous Predictors

For continuous predictors, one row appears per outcome:

```{r}
example4 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection", "icu_admission"),
  predictor = "age",
  covariates = c("sex", "treatment", "surgery"),
  labels = clintrial_labels,
  parallel = FALSE
)

example4
```

The effect estimate represents the change in log-odds per one-unit increase in the predictor.

## **Example 5:** Multilevel Categorical Predictors

For categorical predictors with multiple levels, the display is expanded:

```{r}
example5 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "treatment",
  covariates = c("age", "sex", "surgery"),
  labels = clintrial_labels,
  parallel = FALSE
)

example5
```

---

# Model Types

## **Example 6:** Cox Regression

For time-to-event outcomes, use `model_type = "coxph"`:

```{r}
example6 <- multifit(
  data = clintrial,
  outcomes = c("Surv(pfs_months, pfs_status)",
               "Surv(os_months, os_status)"),
  predictor = "treatment",
  covariates = c("age", "sex", "stage"),
  model_type = "coxph",
  labels = clintrial_labels,
  parallel = FALSE
)

example6
```

## **Example 7:** Linear Regression

For continuous outcomes, use `model_type = "lm"`:

```{r}
example7 <- multifit(
  data = clintrial,
  outcomes = c("los_days", "pain_score", "recovery_days"),
  predictor = "treatment",
  covariates = c("age", "sex", "surgery"),
  model_type = "lm",
  labels = clintrial_labels,
  parallel = FALSE
)

example7
```

## **Example 8:** Mixed-Effects Models

Account for clustering using random effects:

```{r}
example8 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection"),
  predictor = "treatment",
  covariates = c("age", "sex"),
  random = "(1|site)",
  model_type = "glmer",
  labels = clintrial_labels,
  parallel = FALSE
)

example8
```

---

# Advanced Features

## **Example 9:** Interaction Terms

Interaction terms can be added to test effect modification:

```{r}
example9 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection"),
  predictor = "treatment",
  covariates = c("age", "sex"),
  interactions = c("treatment:sex"),
  labels = clintrial_labels,
  parallel = FALSE
)

example9
```

## **Example 10:** Filtering by p-value

Outputs can be filtered to retain only significant associations:

```{r}
example10 <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "treatment",
  covariates = c("age", "sex", "surgery"),
  p_threshold = 0.01,
  labels = clintrial_labels,
  parallel = FALSE
)

example10
```

## **Example 11:** Accessing Model Objects

Underlying model objects are stored as attributes for additional analysis:

```{r}
result <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection"),
  predictor = "treatment",
  covariates = c("age", "sex"),
  labels = clintrial_labels,
  keep_models = TRUE,
  parallel = FALSE
)

# Access individual models
models <- attr(result, "models")
names(models)

# Examine a specific model
summary(models[["any_complication"]])
```

---

# Forest Plot Visualization

The `multiforest()` function creates forest plots from `multifit()` results.

## **Example 12:** Forest Plot for Logistic (Binary) Outcomes

Results from `multifit()` can be directly inserted into the `multiforest()` function to generate a forest plot:

```{r}
result <- multifit(
  data = clintrial,
  outcomes = c("any_complication", "wound_infection",
               "readmission_30d", "icu_admission"),
  predictor = "treatment",
  covariates = c("age", "sex", "diabetes", "surgery"),
  labels = clintrial_labels,
  parallel = FALSE
)

example12 <- multiforest(
  result,
  title = "Treatment Effects Across Outcomes",
  indent_predictor = TRUE,
  zebra_stripes = TRUE
)
queue_plot(example12)
```

```{r, echo = FALSE, out.width = "100%", use_rec_dims = TRUE}
print(example12)
```

## **Example 13:** Customization Options

Customize the appearance of the forest plot by adjusting function parameters:

```{r}
example13 <- multiforest(
  result,
  title = "Effect Estimates",
  column = "adjusted",
  show_predictor = FALSE,
  covariates_footer = TRUE,
  table_width = 0.65,
  color = "#4BA6B6"
)
queue_plot(example13)
```

```{r, echo = FALSE, out.width = "100%", use_rec_dims = TRUE}
print(example13)
```

## **Example 14:** Forest Plot for Continuous Outcomes

Create forest plots for continuous outcomes:

```{r}
lm_result <- multifit(
  data = clintrial,
  outcomes = c("pain_score", "recovery_days", "los_days"),
  predictor = "treatment",
  covariates = c("age", "sex", "surgery"),
  model_type = "lm",
  parallel = FALSE
)

example14 <- multiforest(
  lm_result,
  title = "Treatment Effects on Recovery Metrics",
  show_predictor = FALSE,
  covariates_footer = TRUE,
  labels = clintrial_labels
)
queue_plot(example14)
```

```{r, echo = FALSE, out.width = "100%", use_rec_dims = TRUE}
print(example14)
```

## **Example 15:** Forest Plot for Survival Outcomes

The `multiforest()` function supports multiple `model_type` outputs from `multifit()`. In addition, variable labels can be applied to `multiforest()` outputs similarly to other forest plot functions:

```{r}
cox_result <- multifit(
  data = clintrial,
  outcomes = c("Surv(pfs_months, pfs_status)",
               "Surv(os_months, os_status)"),
  predictor = "treatment",
  covariates = c("age", "sex", "stage"),
  model_type = "coxph",
  parallel = FALSE
)

example15 <- multiforest(
  cox_result,
  title = "Treatment Effects on Survival Outcomes",
  indent_predictor = TRUE,
  zebra_stripes = TRUE,
  labels = clintrial_labels
)
queue_plot(example15)
```

```{r, echo = FALSE, out.width = "100%", use_rec_dims = TRUE}
print(example15)
```

---

# Exporting Results

## Tables

Export tables using standard export functions:

```{r, eval = FALSE}
table2docx(
  table = result,
  file = file.path(tempdir(), "multioutcome_analysis.docx"),
  caption = "Treatment Effects Across Outcomes"
)

table2pdf(
  table = result,
  file = file.path(tempdir(), "multioutcome_analysis.pdf"),
  caption = "Treatment Effects Across Outcomes"
)
```

## Forest Plots

Save forest plots using `ggsave()`:

```{r, eval = FALSE}
p <- multiforest(result, title = "Effect Estimates")
dims <- attr(p, "rec_dims")

ggsave(file.path(tempdir(), "multioutcome_forest.pdf"), p,
       width = attr(result, "rec_dims")$width,
       height = attr(result, "rec_dims")$height, 
       units = "in")
```

---

## Complete Workflow Example

The following demonstrates a complete multi-outcome analysis workflow:

```{r}
## Define outcomes by type
binary_outcomes <- c("any_complication", "wound_infection",
                     "readmission_30d", "icu_admission")
survival_outcomes <- c("Surv(pfs_months, pfs_status)",
                       "Surv(os_months, os_status)")

## Unadjusted screening
unadjusted <- multifit(
  data = clintrial,
  outcomes = binary_outcomes,
  predictor = "treatment",
  labels = clintrial_labels,
  parallel = FALSE
)

unadjusted

## Adjusted analysis with comparison
adjusted <- multifit(
  data = clintrial,
  outcomes = binary_outcomes,
  predictor = "treatment",
  covariates = c("age", "sex", "diabetes", "surgery"),
  columns = "both",
  labels = clintrial_labels,
  parallel = FALSE
)

adjusted

## Forest plot visualization
forest_plot <- multiforest(
  adjusted,
  title = "Treatment Effect Estimates",
  column = "adjusted",
  indent_predictor = TRUE,
  zebra_stripes = TRUE,
  table_width = 0.65,
  labels = clintrial_labels
)
queue_plot(forest_plot)
```

```{r, echo = FALSE, out.width = "100%", use_rec_dims = TRUE}
print(forest_plot)
```

---

## Parameter Reference

| Parameter | Description |
|:----------|:------------|
| `outcomes` | Character vector of outcome variable names |
| `predictor` | Single predictor variable name |
| `covariates` | Optional adjustment covariates |
| `interactions` | Interaction terms (colon notation) |
| `random` | Random effects formula for mixed models |
| `strata` | Stratification variable (Cox models) |
| `cluster` | Clustering variable for robust standard errors |
| `model_type` | `"glm"`, `"lm"`, `"coxph"`, `"glmer"`, `"lmer"`, `"coxme"` |
| `family` | GLM family (default: `"binomial"`) |
| `columns` | `"adjusted"`, `"unadjusted"`, or `"both"` |
| `p_threshold` | Filter results by *p*-value |
| `labels` | Custom variable labels |
| `parallel` | Enable parallel processing |

---

# Best Practices

## Outcome Selection

1. Use compatible outcome types within a single analysis
2. Group conceptually related outcomes
3. Consider multiple testing adjustments when testing many outcomes

## Adjustment Strategy

1. Pre-specify covariates based on domain knowledge
2. Use `columns = "both"` to assess confounding
3. Apply consistent covariates across outcomes for comparability

## Interpretation

1. Focus on effect magnitude and precision, not only *p*-values
2. Look for consistent patterns across related outcomes
3. Consider practical significance alongside statistical significance

---

# Common Issues

## Empty Results

If results are empty, verify:

- Predictor variable exists and has variation
- Outcomes are correctly specified (`Surv()` syntax for survival)
- Model type matches outcome type

## Convergence Warnings

For mixed-effects models, simplify the random effects structure:

```{r, eval = FALSE}
## Start with random intercepts only
multifit(data, outcomes, predictor,
         random = "(1|site)",
         model_type = "glmer")
```

## Many Factor Levels

For predictors with many levels, consider collapsing categories:

```{r, eval = FALSE}
clintrial$treatment_binary <- ifelse(clintrial$treatment == "Control", 
                                      "Control", "Active")
```

---

# Other Considerations

## Multivariate Regression vs. Univariable Screening

The distinction between `multifit()` and `uniscreen()` is important:

| Function | Tests | Use Case |
|:---------|:------|:---------|
| `uniscreen()` | Multiple predictors → One outcome | Variable screening, risk factor identification |
| `multifit()` | One predictor → Multiple outcomes | Exposure effects, intervention evaluation |

All outcomes in a single `multifit()` call should be of the same type (all binary, all continuous, or all survival). Mixing outcome types produces tables with incompatible effect measures. The function validates outcome compatibility and issues a warning when mixed types are detected.

## Categorical Outcomes with More Than Two Levels

`multifit()` supports binary outcomes (via logistic regression) but not multinomial or ordinal outcomes. If a categorical outcome with three or more levels is included (e.g., treatment group with "Control", "Drug A", "Drug B"), the function will issue a warning because binomial GLM coerces such variables to binary (first level *vs.* all others).

For multilevel categorical outcomes, use dedicated packages outside of `summata`:

```{r, eval = FALSE}
## Multinomial regression (unordered categories)
library(nnet)
model <- multinom(treatment ~ age + sex + stage, data = clintrial)

## Ordinal regression (ordered categories)
library(MASS)
model <- polr(grade ~ age + sex + stage, data = clintrial, Hess = TRUE)
```

```{r, include = FALSE}
options(old_opts)
```

---

## Further Reading

- [Descriptive Tables](descriptive_tables.html): `desctable()` for baseline characteristics
- [Regression Modeling](regression_modeling.html): `fit()`, `uniscreen()`, and `fullfit()`
- [Model Comparison](model_comparison.html): `compfit()` for comparing models
- [Table Export](table_export.html): Export to PDF, Word, and other formats
- [Forest Plots](forest_plots.html): Visualization of regression results
- [Advanced Workflows](advanced_workflows.html): Interactions and mixed-effects models