---
title: "Regression Modeling"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Regression Modeling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  fig.width = 12,
  fig.height = 6,
  dpi = 150
)

# Use ragg for better font rendering if available
if (requireNamespace("ragg", quietly = TRUE)) {
  knitr::opts_chunk$set(dev = "ragg_png")
}

old_opts <- options(width = 180)
```

Regression analysis quantifies the relationship between a response variable and one or more predictors while accounting for potential confounding. The standard analytical workflow proceeds in two phases: *univariable* (unadjusted) analysis, in which each predictor is examined independently; and *multivariable* (adjusted) analysis, in which effects conditional on other predictors are examined collectively. Comparing these estimates found in these analyses reveals the extent of confounding and identifies independent predictors of a given outcome.

The `summata` package provides three principal functions for regression analysis:

| Function | Purpose |
|:---------|:--------|
| `uniscreen()` | Univariable screening of multiple predictors |
| `fit()` | Single univariable/multivariable model |
| `fullfit()` | Combined univariable and multivariable analysis |

All functions support a wide range of model types including linear models, generalized linear models (logistic, Poisson, Gaussian, Gamma, negative binomial), Cox proportional hazards models, and mixed-effects models, with consistent syntax and formatted output.

As with other `summata` functions, these functions adhere to the standard calling convention:

```{r, eval = FALSE}
fullfit(data, outcome, predictors, model_type, ...)
```

where `data` is the dataset, `outcome` is the outcome variable, `predictors` is a vector of predicting variables, and `model_type` is the type of model to be implemented. This vignette demonstrates the various capabilities of these functions using the included sample dataset.

---

# Preliminaries

The examples in this vignette use the `clintrial` dataset included with `summata`:

```{r setup}
library(summata)
library(survival)

data(clintrial)
data(clintrial_labels)
```

The `clintrial` dataset simulates a multisite oncology clinical trial with realistic outcomes and site-level clustering. Variables suitable for different regression approaches include:

| Variable | Description | Type | Application |
|:---------|:------------|:-----|:------------|
| `pain_score` | Postoperative Pain Score (0-10) | Continuous | Linear regression |
| `readmission_30d` | 30-Day Readmission | Binary | Logistic regression |
| `any_complication` | Any Complication | Binary | Logistic regression |
| `los_days` | Length of Hospital Stay (days) | Continuous (positive) | Linear or Gamma regression |
| `recovery_days` | Days to Functional Recovery | Continuous (positive) | Linear or Gamma regression |
| `ae_count` | Adverse Event Count | Count (overdispersed) | Negative binomial / quasipoisson |
| `fu_count` | Follow-Up Visit Count | Count (equidispersed) | Poisson regression |
| `pfs_months`, `pfs_status` | Progression-Free Survival Time (months) | Time-to-event | Cox regression |
| `os_months`, `os_status` | Overall Survival Time (months) | Time-to-event | Cox regression |
| `site` | Study Site | Clustering variable | Mixed-effects models |

---

# Supported Model Types

The following table summarizes all supported model types (`model_type`) and families/links (`family`). For most estimates, effect measures are displayed by default; set `exponentiate = FALSE` to display raw coefficients.

| `model_type` | `family` | Outcome | Effect Measure | Package |
|:-------------|:------------|:--------|:---------------|:--------|
| `"lm"` | — | Continuous | *β* coefficient | `stats` |
| `"glm"` | `"binomial"` | Binary | Odds ratio | `stats` |
| `"glm"` | `"quasibinomial"` | Binary (overdispersed) | Odds ratio | `stats` |
| `"glm"` | `"poisson"` | Count | Rate ratio | `stats` |
| `"glm"` | `"quasipoisson"` | Count (overdispersed) | Rate ratio | `stats` |
| `"glm"` | `"gaussian"` | Continuous | *β* coefficient | `stats` |
| `"glm"` | `"Gamma"` | Positive continuous | Ratio (with log link) | `stats` |
| `"glm"` | `"inverse.gaussian"` | Positive continuous | Coefficient | `stats` |
| `"negbin"` | — | Count (overdispersed) | Rate ratio | `MASS` |
| `"coxph"` | — | Time-to-event | Hazard ratio | `survival` |
| `"clogit"` | — | Matched binary | Odds ratio | `survival` |
| `"lmer"` | — | Continuous (clustered) | *β* coefficient | `lme4` |
| `"glmer"` | `"binomial"` | Binary (clustered) | Odds ratio | `lme4` |
| `"glmer"` | `"poisson"` | Count (clustered) | Rate ratio | `lme4` |
| `"coxme"` | — | Time-to-event (clustered) | Hazard ratio | `coxme` |

---

# Univariable Screening

The `uniscreen()` function fits separate regression models for each predictor, combining results into a single table. This approach identifies candidate predictors for multivariable modeling.

## **Example 1:** Binary Outcome Screen (Logistic Regression)

For binary outcomes, use `model_type = "glm"` (logistic regression is the default family). Here we examine predictors of 30-day hospital readmission:

```{r}
screening_vars <- c("age", "sex", "race", "bmi", "smoking", 
                    "diabetes", "stage", "ecog", "treatment")

example1 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = screening_vars,
  model_type = "glm",
  labels = clintrial_labels
)

example1
```

Each entry represents a separate univariable model. For categorical predictors, each non-reference level appears as a separate row.

## **Example 2:** Filtering Screen by *p*-value

The `p_threshold` parameter retains only predictors meeting a significance criterion:

```{r}
example2 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = screening_vars,
  model_type = "glm",
  p_threshold = 0.01,
  labels = clintrial_labels
)

example2
```

## **Example 3:** Continuous Outcome Screen (Linear Regression)

For continuous outcomes, specify `model_type = "lm"`:

```{r}
example4 <- uniscreen(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "stage", "diabetes", "ecog"),
  model_type = "lm",
  labels = clintrial_labels
)

example4
```

## **Example 4:** Time-to-Event Outcome Screen (Cox Regression)

For survival outcomes, specify the outcome using `Surv()` notation and set `model_type = "coxph"`:

```{r}
example3 <- uniscreen(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "ecog"),
  model_type = "coxph",
  labels = clintrial_labels
)

example3
```

## **Example 5:** Retaining Model Objects

Setting `keep_models = TRUE` stores the fitted model objects for diagnostics:

```{r}
example5 <- uniscreen(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  keep_models = TRUE
)

# Access individual models
models <- attr(example5, "models")
names(models)

# Examine a specific model
summary(models[["age"]])
```

---

# Multivariable Modeling

The `fit()` function estimates a single regression model with multiple predictors, producing adjusted effect estimates. It is effectively a wrapper for existing model functions in R and other supported packages that enforces `summata` syntax for more consistent function calling.

## **Example 6:** Logistic Regression

For binary outcomes with multiple predictors:

```{r}
example6 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "treatment", "stage", "diabetes"),
  model_type = "glm",
  labels = clintrial_labels
)

example6
```

## **Example 8:** Linear Regression

For continuous outcomes:

```{r}
example8 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "stage", "ecog"),
  model_type = "lm",
  labels = clintrial_labels
)

example8
```

## **Example 7:** Cox Regression

For time-to-event outcomes:

```{r}
example7 <- fit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage"),
  model_type = "coxph",
  labels = clintrial_labels
)

example7
```

## **Example 9:** Poisson Regression

For count outcomes where variance approximately equals the mean, use `model_type = "glm"` with `family = "poisson"`. The `fu_count` variable represents the number of follow-up clinic visits, which is equidispersed (suitable for standard Poisson):
  
```{r}
example9 <- fit(
  data = clintrial,
  outcome = "fu_count",
  predictors = c("age", "stage", "treatment", "surgery"),
  model_type = "glm",
  family = "poisson",
  labels = clintrial_labels
)

example9
```

Output displays rate ratios (RR). A rate ratio of 1.10 indicates a 10% higher event rate compared to the reference group. For overdispersed counts (variance > mean), consider negative binomial or quasipoisson regression instead (see Example 18).

## **Example 10:** Toggling Reference Rows for Factors

By default, reference rows are shown in output tables (`reference_rows = TRUE`). Setting this parameter to `FALSE` removes these reference rows:

```{r}
example10 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("sex", "stage", "treatment"),
  model_type = "glm",
  reference_rows = FALSE,
  labels = clintrial_labels
)

example10
```

## **Example 11:** Confidence Level

The `conf_level` parameter adjusts the confidence interval width:

```{r}
example11 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  conf_level = 0.90
)

example11
```

## **Example 12:** Raw Coefficients

For logistic and Cox models, set `exponentiate = FALSE` to display log-scale coefficients (*β* rather than *e^β^*):

```{r}
example12 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  exponentiate = FALSE
)

example12
```

---

# Combined Analysis

The `fullfit()` function integrates univariable screening and multivariable modeling into a single workflow, producing a combined table with both unadjusted and adjusted estimates.

This function extends `fit()` with an additional parameter (`method`) for variable selection:

```{r, eval = FALSE}
fullfit(data, outcome, predictors, model_type, method, ...)
```

For the `method` parameter, the following options are available. Setting `method` to one of the following options controls the predictors entering the multivariable model:

| Method | Description |
|:-------|:------------|
| `"screen"` | Only predictors meeting the *p*-threshold in univariable analysis are used in the multivariable model (default) |
| `"all"` | All predictors are used in both univariable and multivariable analyses |
| `"custom"` | Multivariable predictors are explicitly specified |

## **Example 13:** Screening-Based Selection

The default `method = "screen"` approach specifies that only predictors with univariable *p*-value below `p_threshold` are subsequently used in the multivariable analysis:

```{r}
example13 <- fullfit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "bmi", "smoking", "diabetes",
                 "stage", "treatment"),
  model_type = "glm",
  method = "screen",
  p_threshold = 0.05,
  labels = clintrial_labels
)

example13
```

## **Example 14:** All Predictors

The `method = "all"` approach includes the same predictors in both univariable and multivariable analysis:

```{r}
example14 <- fullfit(
  data = clintrial,
  outcome = "any_complication",
  predictors = c("age", "sex", "treatment", "stage"),
  model_type = "glm",
  method = "all",
  labels = clintrial_labels
)

example14
```

## **Example 15:** Custom Selection

The `method = "custom"` approach allows explicit specification of multivariable predictors via the `multi_predictors` argument:

```{r}
example15 <- fullfit(
  data = clintrial,
  outcome = "icu_admission",
  predictors = c("age", "sex", "bmi", "smoking", "stage", "treatment"),
  model_type = "glm",
  method = "custom",
  multi_predictors = c("age", "sex", "stage", "treatment"),
  labels = clintrial_labels
)

example15
```

## **Example 16:** Controlling Output Columns

The `columns` parameter controls which results are displayed:

```{r}
# Univariable only
example16a <- fullfit(
  data = clintrial,
  outcome = "wound_infection",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  columns = "uni"
)

example16a

# Multivariable only
example16b <- fullfit(
  data = clintrial,
  outcome = "wound_infection",
  predictors = c("age", "sex", "stage"),
  model_type = "glm",
  columns = "multi"
)

example16b
```

## **Example 17:** Survival Analysis

Output tables can use Cox regression for survival outcomes by specifying `Surv()` notation and `model_type = "coxph"`:

```{r}
example17 <- fullfit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "ecog"),
  model_type = "coxph",
  method = "screen",
  p_threshold = 0.10,
  labels = clintrial_labels
)

example17
```

---

# Extended Model Types

The `summata` package supports several additional model types for specialized analyses. These require external packages that are suggested dependencies.

## **Example 18:** Negative Binomial

Negative binomial models are appropriate for overdispersed count data where the variance exceeds the mean—a common violation of the Poisson assumption. This model type requires the `MASS` package. The `ae_count` variable in `clintrial` is generated with overdispersion, making it ideal for this demonstration.

```{r}
example18 <- fit(
  data = clintrial,
  outcome = "ae_count",
  predictors = c("age", "sex", "diabetes", "treatment"),
  model_type = "negbin",
  labels = clintrial_labels
)

example18
```

## **Example 19:** Gamma Regression

Gamma regression is appropriate for positive, continuous, right-skewed outcomes. In the `clintrial` dataset, length of hospital stay (`los_days`) fits this description. Note that the default link for Gamma is the *inverse* link, which produces coefficients on a difficult-to-interpret scale. The log link is generally preferred for interpretability—exponentiated coefficients then represent multiplicative effects on the mean.

```{r}
example19 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "ecog", "stage", "treatment"),
  model_type = "glm",
  family = Gamma(link = "log"),
  labels = clintrial_labels
)

example19
```

## **Example 20:** Random-Intercept Model

Linear mixed-effects models (LMMs) account for hierarchical or clustered data structures. The `clintrial` dataset includes a `site` variable representing the study site, making it suitable for demonstrating random effects. This model type requires the `lme4` package.

Include random intercepts for study site using `(1|site)` notation in the predictors:

```{r, eval = requireNamespace("lme4", quietly = TRUE)}
example20 <- fit(
  data = clintrial,
  outcome = "los_days",
  predictors = c("age", "sex", "treatment", "stage", "(1|site)"),
  model_type = "lmer",
  labels = clintrial_labels
)

example20
```

## **Example 21:** GLMM for Binary Outcome

Generalized linear mixed-effects models (GLMMs) extend mixed-effects models to non-normal outcomes (binary, count). This model type also requires the `lme4` package.

Model 30-day readmission with site-level random effects:

```{r, eval = requireNamespace("lme4", quietly = TRUE)}
example21 <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "sex", "diabetes", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "binomial",
  labels = clintrial_labels
)

example21
```

## **Example 22:** GLMM for Count Outcome

For count outcomes with clustering, use `fu_count` (equidispersed) with site-level random effects. Standard Poisson GLMMs assume equidispersion:

```{r, eval = requireNamespace("lme4", quietly = TRUE)}
example22 <- fit(
  data = clintrial,
  outcome = "fu_count",
  predictors = c("age", "stage", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "poisson",
  labels = clintrial_labels
)

example22
```

## **Example 23:** Cox Mixed-Effects Model

Cox mixed-effects models account for within-cluster correlation in survival outcomes. This is useful when patients are nested within sites or physicians and outcomes may be correlated within clusters. This model type requires the `coxme` package. Model overall survival with site-level random effects:
  
```{r, eval = requireNamespace("coxme", quietly = TRUE)}
example23 <- fit(
  data = clintrial,
  outcome = "Surv(os_months, os_status)",
  predictors = c("age", "sex", "treatment", "stage", "(1|site)"),
  model_type = "coxme",
  labels = clintrial_labels
)

example23
```

## **Example 24:** Quasibinomial for Overdispersed Binary Data

When binary or count data exhibit overdispersion (residual deviance >> residual degrees of freedom), quasi-likelihood models provide more appropriate standard errors.

```{r}
example24 <- fit(
  data = clintrial,
  outcome = "any_complication",
  predictors = c("age", "sex", "diabetes", "stage"),
  model_type = "glm",
  family = "quasibinomial",
  labels = clintrial_labels
)

example24
```

## **Example 25:** Conditional Logistic Model for Matched Data

Conditional logistic regression is used for matched case-control studies where cases and controls are matched within strata. The stratification variable should be specified using the `strata` parameter. This model type requires the `survival` package.

```{r}
# Matched case-control dataset (100 pairs)
set.seed(456)
n_pairs <- 100
matched_data <- do.call(rbind, lapply(1:n_pairs, function(i) {
  base_bmi <- rnorm(1, 27, 4)
  data.frame(
    match_id = i,
    case = c(1, 0),
    smoking = factor(c(rbinom(1, 1, 0.45), rbinom(1, 1, 0.30)), 
                     levels = c(0, 1), labels = c("No", "Yes")),
    diabetes = factor(c(rbinom(1, 1, 0.30), rbinom(1, 1, 0.20)), 
                      levels = c(0, 1), labels = c("No", "Yes")),
    bmi = c(base_bmi + rnorm(1, 1.5, 2), base_bmi + rnorm(1, -0.5, 2))
  )
}))

example25 <- fit(
  data = matched_data,
  outcome = "case",
  predictors = c("smoking", "diabetes", "bmi"),
  model_type = "clogit",
  strata = "match_id"
)

example25
```

The output shows odds ratios conditional on the matching strata.

---

# Exporting Results

Regression tables can be exported to various formats:

```{r, eval = FALSE}
# Microsoft Word
table2docx(
  table = example13,
  file = file.path(tempdir(), "Table2_Regression.docx"),
  caption = "Table 2. Univariable and Multivariable Analysis"
)

# PDF
table2pdf(
  table = example13,
  file = file.path(tempdir(), "Table2_Regression.pdf"),
  caption = "Table 2. Regression Results"
)
```

See the [Table Export](table_export.html) vignette for comprehensive documentation.

---

# Best Practices

## Variable Selection Strategy

1. **Exploratory phase**: Use `uniscreen()` with liberal *p*-threshold (e.g., 0.20)
2. **Model building**: Use `fullfit()` with `method = "screen"` or `method = "custom"`
3. **Confirmatory analysis**: Use `fit()` with pre-specified predictors
4. **Sensitivity analysis**: Compare specifications with `compfit()` (*see* [Model Comparison](model_comparison.html))

## Choosing the Right Model

| Outcome Type | Characteristics | Recommended Model |
|:-------------|:----------------|:------------------|
| Binary | Independent observations | `glm` with `binomial` |
| Binary | Overdispersed | `glm` with `quasibinomial` |
| Binary | Clustered/hierarchical | `glmer` with `binomial` |
| Binary | Matched case-control | `clogit` |
| Count | Mean ≈ variance | `glm` with `poisson` |
| Count | Variance > mean | `negbin` or `quasipoisson` |
| Count | Clustered | `glmer` with `poisson` |
| Continuous | Symmetric, normal errors | `lm` |
| Continuous | Positive, right-skewed | `glm` with `Gamma` |
| Continuous | Clustered | `lmer` |
| Time-to-event | Independent | `coxph` |
| Time-to-event | Clustered | `coxme` |

## Interpreting Results

The comparison between univariable and multivariable estimates reveals confounding:

- **Large differences**: Substantial confounding present
- **Similar estimates**: Association robust to adjustment
- **Sign reversal**: Simpson's paradox; investigate carefully

## Sample Size Considerations

Adequate events per predictor are required for stable estimation:

- Logistic/Cox models: ≥10 events per predictor (rule of thumb)
- Linear models: ≥10–20 observations per predictor
- Mixed-effects models: ≥5–10 clusters recommended
- Use `method = "screen"` to reduce predictors when sample size is limited

---

# Common Issues

## Missing Data

Regression functions use complete-case analysis by default:

```{r, eval = FALSE}
# Check missingness
sapply(clintrial[, c("age", "sex", "stage")], function(x) sum(is.na(x)))

# Create complete-case dataset explicitly
complete_data <- na.omit(clintrial[, c("readmission_30d", "age", "sex", "stage")])
```

## Factor Reference Levels

Ensure reference categories are set appropriately:

```{r, eval = FALSE}
# Set specific reference level
clintrial$stage <- relevel(factor(clintrial$stage), ref = "I")
```

## Convergence Issues

For models that fail to converge:

```{r, eval = FALSE}
# Access model for diagnostics
result <- fit(data, outcome, predictors, model_type = "glm")
model <- attr(result, "model")

# Check convergence
model$converged

# Large coefficients may indicate separation
coef(model)
```

## Overdispersion Diagnostics

To check for overdispersion in Poisson or binomial models:

```{r, eval = FALSE}
# Fit Poisson model
result <- fit(data, outcome, predictors, model_type = "glm", family = "poisson")
model <- attr(result, "model")

# Dispersion estimate (should be ~1 for no overdispersion)
sum(residuals(model, type = "pearson")^2) / model$df.residual
```

If the dispersion is substantially greater than 1, consider using `quasipoisson`, `quasibinomial`, or `negbin` instead.

## Mixed-Effects Model Convergence

Mixed-effects models can be sensitive to starting values and optimization:

```{r, eval = FALSE}
# Increase iterations
result <- fit(
  data = clintrial,
  outcome = "readmission_30d",
  predictors = c("age", "treatment", "(1|site)"),
  model_type = "glmer",
  family = "binomial",
  control = lme4::glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 100000))
)
```

## Multinomial and Ordinal Outcomes

The `summata` package currently supports binary outcomes (via logistic regression) but not multinomial or ordinal outcomes with more than two categories. For categorical outcomes with more than two levels, use dedicated packages such as `nnet::multinom()` for multinomial regression or `MASS::polr()` for ordinal regression.

```{r, include = FALSE}
options(old_opts)
```

---

# Further Reading

- [Descriptive Tables](descriptive_tables.html): `desctable()` for baseline characteristics
- [Model Comparison](model_comparison.html): `compfit()` for comparing models
- [Table Export](table_export.html): Export to PDF, Word, and other formats
- [Forest Plots](forest_plots.html): Visualization of regression results
- [Multivariate Regression](multivariate_regression.html): `multifit()` for multi-outcome analysis
- [Advanced Workflows](advanced_workflows.html): Interactions and mixed-effects models