--- title: "Automatic Variable Labeling" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Automatic Variable Labeling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Set gtsummary print engine for proper rendering options(gtsummary.print_engine = "gt") ``` ```{r setup} #| eval: false library(sumExtras) library(gtsummary) library(dplyr) library(gt) # Apply the recommended JAMA theme use_jama_theme() ``` ```{r setup2} #| echo: false #| message: false #| warning: false library(sumExtras) library(gtsummary) library(dplyr) library(gt) library(ggplot2) # Apply the recommended JAMA theme use_jama_theme() ``` ## Overview One of the most time-consuming aspects of creating publication-ready tables and plots is labeling variables with human-readable descriptions. Instead of manually typing labels for every variable in every table and plot, sumExtras provides a unified labeling system that works across gtsummary and ggplot2. This vignette covers: 1. How R's label attribute system works and why it matters 2. Creating and maintaining data dictionaries 3. Labeling gtsummary tables with `add_auto_labels()` 4. Setting label attributes with `apply_labels_from_dictionary()` 5. Controlling label priority when multiple sources exist 6. Cross-package workflows with gtsummary and ggplot2 7. Real-world analysis examples ## How It Works: The R Label Convention sumExtras uses R's built-in `attr()` function to work with variable labels - the same labeling approach used by haven, Hmisc, labelled, and ggplot2 4.0+. This means labels work seamlessly across the R ecosystem, whether you're creating tables with gtsummary, plots with ggplot2, or outputs with gt. ### Understanding Label Attributes Labels in R are stored as attributes on individual variables. Here's what happens behind the scenes: ```{r} # Create a simple dataset trial_example <- trial # Set a label attribute on a variable attr(trial_example$age, "label") <- "Age at Enrollment (years)" # Check the label attr(trial_example$age, "label") ``` Once set, this label attribute is recognized by: - **gtsummary** - Used in table headers and variable labels - **ggplot2 4.0+** - Automatically used for axis and legend labels - **gt** - Honored in table outputs - **Hmisc** - Compatible with its labeling functions - **haven** - Preserved when reading/writing SAS, SPSS, Stata files ### Where Labels Come From Your data may already have labels from various sources: - **Statistical software imports** - `haven::read_sas()`, `haven::read_spss()`, `haven::read_stata()` - **R packages** - Hmisc's `label()`, labelled package functions - **Manual assignment** - Setting attributes directly - **Collaborative projects** - Labels from other team members - **sumExtras** - `apply_labels_from_dictionary()` The key is that if labels are there, sumExtras can use them. This flexibility means one labeling system works everywhere - no matter where your data came from or how it was prepared. ## Creating a Data Dictionary A data dictionary serves dual purposes: it documents your variables and provides labels for automatic application. The dictionary is simply a data frame with two required columns: - **`Variable`**: The exact variable names from your dataset - **`Description`**: Human-readable labels you want to display ```{r} # Create a dictionary for the trial dataset dictionary <- tibble::tribble( ~Variable, ~Description, "trt", "Chemotherapy Treatment", "age", "Age at Enrollment (years)", "marker", "Marker Level (ng/mL)", "stage", "T Stage", "grade", "Tumor Grade", "response", "Tumor Response", "death", "Patient Died" ) dictionary ``` ### Best Practices for Dictionaries In real projects, you would typically: 1. **Store externally** - Keep the dictionary as a CSV file or database table 2. **Load once** - Read it at the beginning of your analysis script 3. **Version control** - Track changes to labels over time 4. **Share widely** - Use the same dictionary across all project analyses Example of loading from a CSV: ```{r, eval=FALSE} # Typically at the top of your analysis script dictionary <- readr::read_csv("data/variable_dictionary.csv") ``` This centralizes your variable documentation and ensures consistency across all outputs. ## Labeling gtsummary Tables with `add_auto_labels()` The `add_auto_labels()` function is designed to be flexible and intelligent. It can work with dictionaries, pre-labeled data, or both, and it always respects manual overrides. ### Method 1: Pass Dictionary Explicitly The most straightforward approach is to pass your dictionary directly to the function: ```{r} trial |> tbl_summary(by = trt, include = c(age, grade, marker)) |> add_auto_labels(dictionary = dictionary) |> extras() ``` This approach is explicit and clear - you can see exactly where the labels are coming from. ### Method 2: Automatic Discovery If you have a `dictionary` object in your environment, `add_auto_labels()` will find it automatically without needing to pass it explicitly: ```{r} # Dictionary is already in environment from above trial |> tbl_summary(by = trt, include = c(age, stage, response)) |> add_auto_labels() |> # Finds dictionary automatically extras() ``` The first time `add_auto_labels()` finds your dictionary automatically in a session, you'll see a friendly message: "Auto-labeling from 'dictionary' object in your environment (this message will only show once per session)". This confirms that your dictionary was found and is being used. This is particularly convenient when working in an R Markdown or Quarto document where your dictionary is defined once at the top. ### Method 3: Working with Pre-Labeled Data If your data already has label attributes (from packages like haven, labelled, or set manually), `add_auto_labels()` can read them directly: ```{r} # Create data with label attributes labeled_trial <- trial attr(labeled_trial$age, "label") <- "Patient Age at Baseline" attr(labeled_trial$marker, "label") <- "Biomarker Concentration (ng/mL)" # Use attributes for labeling (no dictionary needed) labeled_trial |> tbl_summary(by = trt, include = c(age, marker)) |> add_auto_labels() # Reads from label attributes ``` This is especially useful when working with data imported from SAS, SPSS, or Stata files that already contain variable labels. ### Manual Overrides Always Win No matter where labels come from (dictionary or attributes), manual labels specified in your `tbl_summary()` call always take precedence: ```{r} trial |> tbl_summary( by = trt, include = c(age, grade, marker), label = list(age ~ "Age (Custom Label)") # This overrides dictionary/attributes ) |> add_auto_labels(dictionary = dictionary) |> extras() ``` This gives you complete control: use automated labeling for most variables, but override specific ones when needed. ### Working with Regression Tables The labeling system works seamlessly with regression tables too: ```{r} lm(marker ~ age + grade + stage, data = trial) |> tbl_regression() |> add_auto_labels(dictionary = dictionary) ``` Labels are applied to both the predictors and the outcome variable, making regression output immediately readable. ## Setting Label Attributes with `apply_labels_from_dictionary()` While `add_auto_labels()` works directly on gtsummary tables, `apply_labels_from_dictionary()` takes a different approach: it sets label attributes on your data frame. This enables cross-package workflows where the same labels work in both gtsummary tables and ggplot2 visualizations. ### Basic Usage ```{r} # Apply labels to data as attributes trial_labeled <- trial |> apply_labels_from_dictionary(dictionary = dictionary) # Check that labels were set attr(trial_labeled$age, "label") attr(trial_labeled$marker, "label") ``` Now this labeled data can be used anywhere R label attributes are recognized. ### Use Labeled Data in gtsummary ```{r} # Labels are automatically recognized trial_labeled |> tbl_summary(by = trt, include = c(age, marker, grade)) |> add_auto_labels() |> # Reads attributes automatically extras() ``` Notice we don't need to pass the dictionary - the labels are already stored as attributes on the data. ### Use Labeled Data in ggplot2 With ggplot2 version 4.0 and later, label attributes are automatically used for axis and legend labels: ```{r, fig.width=7, fig.height=4} #| warning: false # Labels appear automatically on axes and legend! trial_labeled |> ggplot(aes(x = age, y = marker, color = trt)) + geom_point(alpha = 0.6) + theme_minimal() ``` No need to manually specify `labs()` - the labels from your dictionary are applied automatically to the x-axis, y-axis, and legend. ## Controlling Label Priority When your data has both dictionary labels and attribute labels available, `add_auto_labels()` needs to decide which one to use. You control this with a global option. ### Default Behavior: Attributes Have Priority By default, label attributes take precedence over dictionary labels. This respects labels that may have been carefully set by data import functions (like `haven::read_sas()`) or other preprocessing steps: ```{r} # Create data with both sources of labels trial_both <- trial attr(trial_both$age, "label") <- "Age from Attribute" # Also have dictionary (already defined above) dictionary_conflict <- tibble::tribble( ~Variable, ~Description, "age", "Age from Dictionary" ) # Default: attribute wins trial_both |> tbl_summary(by = trt, include = age) |> add_auto_labels(dictionary = dictionary_conflict) |> extras() # Shows: "Age from Attribute" ``` ### Prefer Dictionary: When to Use `TRUE` If you want dictionary labels to override attribute labels, set the `sumExtras.preferDictionary` option to `TRUE`. This is useful when you're actively maintaining a master dictionary and want it to be the single source of truth: ```{r} # Prioritize dictionary over attributes options(sumExtras.preferDictionary = TRUE) trial_both |> tbl_summary(by = trt, include = age) |> add_auto_labels(dictionary = dictionary_conflict) |> extras() # Shows: "Age from Dictionary" # Reset to default for rest of vignette options(sumExtras.preferDictionary = FALSE) ``` ### When to Use Each Setting - **`FALSE` (default)**: You're importing labeled data from SAS/Stata/SPSS and want to preserve those labels - **`TRUE`**: You maintain a master dictionary and want it to override any existing labels Remember: manual labels set via `label = list(...)` in `tbl_summary()` always win, regardless of this option. ## Cross-Package Workflows: Tables and Plots Often you need consistent labels across both gtsummary tables and ggplot2 visualizations. The combination of `apply_labels_from_dictionary()` and `add_auto_labels()` enables this seamlessly. ### Complete Workflow Example Here's a realistic workflow showing how one dictionary serves both gtsummary tables and ggplot2 visualizations: ```{r, fig.width=7, fig.height=5} #| warning: false # 1. Define dictionary once my_dictionary <- tibble::tribble( ~Variable, ~Description, "age", "Age at Enrollment (years)", "marker", "Marker Level (ng/mL)", "trt", "Treatment Group", "grade", "Tumor Grade", "stage", "T Stage" ) # 2. Apply to data trial_final <- trial |> apply_labels_from_dictionary(my_dictionary) # 3. Create gtsummary table trial_final |> tbl_summary( by = trt, include = c(age, marker, grade, stage) ) |> add_auto_labels() |> extras() # 4. Create ggplot2 visualization with same labels trial_final |> filter(!is.na(marker)) |> ggplot(aes(x = age, y = marker)) + geom_point(aes(color = grade), alpha = 0.6) + geom_smooth(method = "lm", se = FALSE, color = "black") + facet_wrap(~trt) + theme_minimal() + theme(legend.position = "bottom") ``` Notice how the axis labels, legend titles, and facet labels are automatically pulled from your dictionary - no manual `labs()` calls needed! This workflow ensures perfect consistency between your tables and plots. ### Benefits of This Approach 1. **One source of truth** - Labels defined once in the dictionary 2. **Consistency** - Same labels in tables and plots automatically 3. **Maintainability** - Update labels in one place 4. **Efficiency** - No repetitive `labs()` or `label = list()` calls 5. **Documentation** - Dictionary serves as project documentation ## Real-World Example: Complete Analysis Here's a comprehensive example showing how the labeling system streamlines a typical analysis workflow: ```{r, fig.width=8, fig.height=6} #| warning: false # Step 1: Define your master dictionary # In practice, this would be loaded from a CSV file study_dictionary <- tibble::tribble( ~Variable, ~Description, "trt", "Treatment Assignment", "age", "Age at Baseline (years)", "marker", "Biomarker Level (ng/mL)", "stage", "Clinical Stage", "grade", "Tumor Grade", "response", "Treatment Response", "death", "Patient Died" ) # Step 2: Apply labels to your data once trial_study <- trial |> apply_labels_from_dictionary(study_dictionary) # Step 3: Create multiple tables using the same labels # Table 1: Overall summary trial_study |> tbl_summary(include = c(age, marker, stage, grade)) |> add_auto_labels() |> extras(overall = TRUE, pval = FALSE) # Table 2: By treatment comparison trial_study |> tbl_summary( by = trt, include = c(age, marker, response) ) |> add_auto_labels() |> extras() # Table 3: Regression analysis lm(marker ~ age + grade + stage, data = trial_study) |> tbl_regression() |> add_auto_labels() # Step 4: Create plots using the same labels # Plot 1: Age distribution by treatment trial_study |> ggplot(aes(x = trt, y = age, fill = trt)) + geom_boxplot(alpha = 0.7) + theme_minimal() + theme(legend.position = "none") # Plot 2: Marker vs age relationship trial_study |> filter(!is.na(marker)) |> ggplot(aes(x = age, y = marker, color = trt)) + geom_point(alpha = 0.5) + geom_smooth(method = "loess", se = FALSE) + theme_minimal() # Plot 3: Response rates by grade and treatment trial_study |> filter(!is.na(response)) |> count(grade, trt, response) |> group_by(grade, trt) |> mutate(prop = n / sum(n)) |> filter(response == 1) |> ggplot(aes(x = grade, y = prop, fill = trt)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::percent) + labs(y = "Response Rate") + theme_minimal() ``` This workflow demonstrates the power of the labeling system: define your labels once in a dictionary, apply them to your data, then create as many tables and plots as you need with consistent, professional labeling throughout. ## Advanced Patterns ### Working with Subsets When you create subsets of labeled data, labels are preserved: ```{r} # Create a subset trial_subset <- trial_labeled |> filter(stage %in% c("T1", "T2")) |> select(age, marker, stage, trt) # Labels are still there trial_subset |> tbl_summary(by = trt) |> add_auto_labels() |> extras() ``` ### Combining with dplyr Operations Labels survive most dplyr operations: ```{r} # Labels persist through mutations trial_labeled |> mutate( age_group = cut(age, breaks = c(0, 50, 70, 100), labels = c("<50", "50-70", ">70")) ) |> select(age, age_group, marker, trt) |> tbl_summary(by = trt, include = c(age, marker)) |> add_auto_labels() |> extras() ``` Note: New variables created with `mutate()` won't have labels unless you set them explicitly or add them to your dictionary. ### Working with Multiple Dictionaries For large projects, you might maintain separate dictionaries for different data domains: ```{r} # Demographics dictionary demographics_dict <- tibble::tribble( ~Variable, ~Description, "age", "Age at Enrollment (years)", "sex", "Biological Sex" ) # Clinical dictionary clinical_dict <- tibble::tribble( ~Variable, ~Description, "marker", "Marker Level (ng/mL)", "stage", "T Stage", "grade", "Tumor Grade" ) # Combine for use combined_dict <- bind_rows(demographics_dict, clinical_dict) trial |> tbl_summary(include = c(age, marker, grade)) |> add_auto_labels(dictionary = combined_dict) |> extras() ``` ## Troubleshooting ### Labels Not Appearing If labels aren't showing up, check: 1. **Variable names match exactly** - Dictionary Variable column must match data exactly (case-sensitive) 2. **Dictionary in scope** - If using auto-discovery, ensure dictionary object exists 3. **Manual labels present** - Manual labels always override automatic ones 4. **Attribute structure** - Use `str(your_data)` to verify label attributes exist ```{r} # Check for label attributes str(trial_labeled$age) ``` ### Dictionary Not Found If you get "dictionary not found" messages: 1. **Name the object 'dictionary'** - Auto-discovery looks for an object named exactly "dictionary" 2. **Pass explicitly** - Use `add_auto_labels(dictionary = my_dict)` if named differently 3. **Check environment** - Ensure dictionary is loaded in the current session ### Conflicting Labels When you have multiple label sources: 1. **Understand priority**: attributes > dictionary (by default) 2. **Use preferDictionary option**: Set `options(sumExtras.preferDictionary = TRUE)` to reverse 3. **Manual override**: Use `label = list(var ~ "Custom")` in `tbl_summary()` for specific variables ## Summary The sumExtras labeling system provides a unified approach to variable labeling across your entire analysis: - **`add_auto_labels()`** - Smart labeling for gtsummary tables (uses dictionary or attributes) - **`apply_labels_from_dictionary()`** - Set labels as data attributes for cross-package workflows - **One dictionary** - Consistent labels across tables, plots, and outputs - **Flexible priority** - Control whether attributes or dictionary takes precedence - **Manual overrides** - Always respected when you need custom labels For more information: - `vignette("sumExtras-intro")` - Getting started with sumExtras - `vignette("styling")` - Advanced table styling and formatting - `?add_auto_labels` - Function documentation - `?apply_labels_from_dictionary` - Function documentation The labeling system is designed to save you time while ensuring consistency. Define your labels once, use them everywhere, and let sumExtras handle the rest.