---
title: "Descriptive Tables"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Descriptive Tables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  dpi = 150
)

# Use ragg for better font rendering if available
if (requireNamespace("ragg", quietly = TRUE)) {
  knitr::opts_chunk$set(dev = "ragg_png")
}

old_opts <- options(width = 180)
```

Descriptive statistics provide the foundation for any quantitative analysis. Before estimating relationships between variables, it is essential to first characterize the distribution of each variable and assess balance across comparison groups. A well-constructed descriptive table—often designated "Table 1" in published research—accomplishes three objectives: it summarizes the central tendency and dispersion of continuous variables, tabulates the frequency distribution of categorical variables, and tests for systematic differences between groups.

The `desctable()` function generates publication-ready descriptive tables with automatic detection of variable types, appropriate summary statistics, and optional hypothesis testing.  It adheres to the to the standard `summata` calling convention:

```{r, eval = FALSE}
desctable(data, by, variables, ...)
```

where `data` is the dataset, `by` specifies the grouping variable (optional), and `variables` lists the variables to summarize. This vignette demonstrates the function’s capabilities using the included sample dataset.

---

# Preliminaries

The examples in this vignette use the `clintrial` dataset included with `summata`:

```{r setup}
library(summata)

data(clintrial)
data(clintrial_labels)
```

The `clintrial` dataset contains `r nrow(clintrial)` observations with continuous, categorical, and time-to-event variables suitable for demonstrating descriptive statistics. The `clintrial_labels` vector provides human-readable labels for display.

---

# Summary Statistics and Tests

The `desctable()` function automatically selects appropriate summary statistics and hypothesis tests based on variable type:

| Variable Type | Summary Statistic | Two Groups | Three+ Groups |
|:--------------|:------------------|:-----------|:--------------|
| Continuous (parametric) | Mean ± SD | *t*-test | ANOVA |
| Continuous (nonparametric) | Median [IQR] | Wilcoxon | Kruskal–Wallis |
| Categorical | *n* (%) | χ² or Fisher | χ² or Fisher |
| Time-to-event | Median (95% CI) | Log-rank | Log-rank |

For categorical variables, Fisher exact test is used when any expected cell count falls below 5. For continuous variables, the test selection follows the displayed statistic: parametric tests are used with mean-based statistics, nonparametric tests with median-based statistics.

---

# Basic Usage

## **Example 1:** Grouped Descriptive Table

The most common use-case for descriptive tables is comparing characteristics across groups:

```{r}
desc_vars <- c("age", "sex", "race", "bmi", "stage", "ecog", 
               "Surv(os_months, os_status)")

example1 <- desctable(
  data = clintrial, 
  by = "treatment", 
  variables = desc_vars,
  labels = clintrial_labels
)

example1
```

The output includes a "Total" column by default, showing overall statistics alongside group-specific values.

## **Example 2:** Ungrouped Summary Statistics

Omitting the `by` argument produces overall summary statistics without group comparisons:

```{r}
example2 <- desctable(
  data = clintrial,
  variables = c("age", "bmi", "sex", "stage"),
  labels = clintrial_labels
)

example2
```

---

# Customizing Summary Statistics

The default summary statistics can be customized for both continuous and categorical variables.

## **Example 3:** Continuous Variables

The `stats_continuous` parameter controls how continuous variables are summarized:

| Value | Output Format |
|:------|:--------------|
| `"mean_sd"` | Mean ± SD |
| `"median_iqr"` | Median [Q1–Q3] (default) |
| `"median_range"` | Median (min–max) |

```{r}
example3 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "bmi", "los_days"),
  stats_continuous = c("mean_sd", "median_iqr", "median_range"),
  labels = clintrial_labels
)

example3
```

## **Example 4:** Categorical Variables

The `stats_categorical` parameter controls categorical variable display:

| Value | Output Format |
|:------|:--------------|
| `"n_percent"` | *n* (%) (default) |
| `"n"` | *n* only |
| `"percent"` | % only |

```{r}
example4 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("sex", "stage", "ecog"),
  stats_categorical = "percent",
  labels = clintrial_labels
)

example4
```

## **Example 5:** Numeric Precision

Control decimal places with `digits` (for statistics) and `p_digits` (for *p*-values):

```{r}
example5 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "bmi", "sex"),
  digits = 2,
  p_digits = 4,
  test = TRUE,
  labels = clintrial_labels
)

example5
```

---

# Statistical Testing

When comparing groups, hypothesis tests assess whether observed differences are statistically significant.

## **Example 6:** Disabling Automatic Test Selection

By default, automatic hypothesis testing based on the summary statistic is displayed (`test = TRUE`). Setting `test = FALSE` disables this functionality:

```{r}
example6 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "bmi", "sex", "stage"),
  test = FALSE,
  labels = clintrial_labels
)

example6
```

## **Example 7:** Specifying Tests Manually

Override automatic selection with `test_continuous` and `test_categorical`. Available test specifications include:

**Continuous** (`test_continuous`):

- `"auto"`: Automatic selection (default)
- `"t"`: Student *t*-test
- `"wrs"`: Wilcoxon rank-sum test
- `"aov"`: One-way ANOVA
- `"kwt"`: Kruskal–Wallis test

**Categorical** (`test_categorical`):

- `"auto"`: Automatic selection (default)
- `"chisq"`: Pearson χ² test
- `"fisher"`: Fisher exact test

The following example forces parametric tests for continuous variables:

```{r}
example7a <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "bmi", "los_days"),
  test = TRUE,
  test_continuous = "aov",  # ANOVA
  labels = clintrial_labels
)

example7a
```

This example forces the Fisher exact test for categorical variables:

```{r}
example7b <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("sex", "stage"),
  test = TRUE,
  test_categorical = "fisher",
  labels = clintrial_labels
)

example7b
```

---

# Handling Missing Data

Missing values require special consideration in descriptive tables. Options control whether missing values are displayed and how percentages are calculated.

## **Example 8:** Including Missing Values

By default, missing values are excluded from calculations. Set `na_include = TRUE` to display them as a separate category:

```{r}
example8 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("smoking", "diabetes"),
  na_include = TRUE,
  labels = clintrial_labels
)

example8
```

## **Example 9:** Missing Value Denominators

The `na_percent` parameter controls whether missing values are included in percentage calculations:

```{r}
# Percentages exclude missing (denominator = non-missing)
example9a <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("smoking"),
  na_include = TRUE,
  na_percent = FALSE,
  labels = clintrial_labels
)

example9a

# Percentages include missing (denominator = total)
example9b <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("smoking"),
  na_include = TRUE,
  na_percent = TRUE,
  labels = clintrial_labels
)

example9b
```

## **Example 10:** Custom Missing Value Label

The label for missing values can be customized using the `na_label` parameter:

```{r}
example10 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("smoking"),
  na_include = TRUE,
  na_label = "Not Reported",
  labels = clintrial_labels
)

example10
```

---

# Total Column Options

The total column provides overall statistics alongside group-specific values.

## **Example 11:** Total Column Configuration

The `total` parameter controls the presence and position of the total column:

| Value | Effect |
|:------|:-------|
| `TRUE`, `"first"` | Total column first (default) |
| `"last"` | Total column last |
| `FALSE` | No total column |

```{r}
# Total column in last position
example11a <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "sex", "stage"),
  total = "last",
  labels = clintrial_labels
)

example11a

# No total column
example11b <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c("age", "sex", "stage"),
  total = FALSE,
  labels = clintrial_labels
)

example11b
```

---

# Complete Example

The following demonstrates a comprehensive descriptive table suitable for publication:

```{r}
table1 <- desctable(
  data = clintrial,
  by = "treatment",
  variables = c(
    "age", "sex", "race", "ethnicity", "bmi",
    "smoking", "diabetes", "hypertension",
    "stage", "grade", "ecog",
    "Surv(os_months, os_status)"
  ),
  labels = clintrial_labels,
  stats_continuous = "mean_sd",
  stats_categorical = "n_percent",
  test = TRUE,
  total = TRUE,
  digits = 1,
  p_digits = 3
)

table1
```

## Accessing Raw Data

The underlying numeric values are stored as an attribute for programmatic access:

```{r}
raw_data <- attr(table1, "raw_data")
head(raw_data)
```

---

# Survival Summary Tables

For detailed survival analysis—including landmark survival estimates, survival quantiles, and multiple endpoints—see the dedicated [Survival Tables](survival_tables.html) vignette. The `survtable()` function provides comprehensive options for reporting time-to-event outcomes.

---

# Exporting Tables

Descriptive tables can be exported to various formats. See the [Table Export](table_export.html) vignette for comprehensive documentation.

```{r, eval = FALSE}
# Microsoft Word
table2docx(
  table = table1,
  file = file.path(tempdir(), "Table1.docx"),
  caption = "Table 1. Baseline Characteristics by Group"
)

# PDF (requires LaTeX)
table2pdf(
  table = table1,
  file = file.path(tempdir(), "Table1.pdf"),
  caption = "Table 1. Baseline Characteristics by Group"
)

# HTML
table2html(
  table = table1,
  file = file.path(tempdir(), "Table1.html"),
  caption = "Table 1. Baseline Characteristics by Group"
)
```

---

# Best Practices

## Variable Selection

1. Include all relevant baseline characteristics
2. Order variables logically (typically chronologically and by domain)
3. Exclude the grouping variable from the variables list

## Statistical Considerations

1. Use automatic test selection unless there is specific justification otherwise
2. Report exact *p*-values when possible; very small values display as "< 0.001" (or to preferred degree of precision)
3. Consider multiple comparison adjustments when testing many variables
4. For skewed or non-normally distributed continuous variables, report using nonparametric statistical procedures (e.g., median with IQR) rather than parametric ones (e.g., mean with SD)

## Formatting Recommendations

1. Use consistent decimal precision within variable types
2. Include units in variable labels (e.g., "Age (years)")
3. Include a total column for context
4. Use landscape orientation for tables with many columns

---

# Common Issues

## Empty Cells in Categorical Variables

When a factor level has zero observations in a group, ensure all levels are explicitly defined:

```{r, eval = FALSE}
data$stage <- factor(data$stage, levels = c("I", "II", "III", "IV"))
```

## Skewed Continuous Variables

For highly skewed distributions, use median and IQR:

```{r, eval = FALSE}
desctable(data, by, variables, stats_continuous = "median_iqr")
```

## Large Tables

For tables with many variables, consider splitting by category or using landscape orientation for export:

```{r, eval = FALSE}
table2pdf(table, file.path(tempdir(), "table1.pdf"), orientation = "landscape", font_size = 8)
```

```{r, include = FALSE}
options(old_opts)
```

---

# Further Reading

- [Survival Tables](survival_tables.html): `survtable()` for time-to-event summaries
- [Regression Modeling](regression_modeling.html): `fit()`, `uniscreen()`, and `fullfit()`
- [Model Comparison](model_comparison.html): `compfit()` for comparing models
- [Table Export](table_export.html): Export to PDF, Word, and other formats
- [Forest Plots](forest_plots.html): Visualization of regression results
- [Multivariate Regression](multivariate_regression.html): `multifit()` for multi-outcome analysis
- [Advanced Workflows](advanced_workflows.html): Interactions and mixed-effects models