--- title: "Variable schema reference" format: html vignette: > %\VignetteIndexEntry{Variable schema reference} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, echo=FALSE, message=FALSE, warning=FALSE} library(chmsflow) library(DT) library(knitr) library(kableExtra) ``` ## Overview chmsflow uses two CSV metadata files to define how raw CHMS variables are harmonized. These files are bundled with the package in `inst/extdata/` and are also available as data objects (`variables` and `variable_details`). - **`variables.csv`** -- lists every harmonized variable with its name, label, type, and unit - **`variable-details.csv`** -- defines the row-by-row recoding rules that `rec_with_table()` applies This vignette is a column-by-column reference for both files. For an explanation of how these files fit into the harmonization workflow, see [Methodology](methodology.html). ## `variables.csv` ```{r, echo=FALSE} cat( "There are", nrow(variables), "variables, grouped in", sum(!duplicated(variables$subject)), "subjects and", sum(!duplicated(variables$section)), "sections.\n" ) ``` ```{r echo=FALSE, results='asis', warning=FALSE} datatable(variables, filter = "top", options = list(pageLength = 5)) ``` ### Columns **1. `variable`** -- the name of the harmonized variable. **2. `label`** -- a short label for the variable. **3. `labelLong`** -- a more detailed label for the variable. **4. `section`** -- the broad grouping where this variable belongs (e.g., sociodemographics, health behaviour, health status). **5. `subject`** -- the specific topic the variable pertains to (e.g., age, smoking, blood pressure). **6. `variableType`** -- whether the harmonized variable is `Categorical` or `Continuous`. **7. `units`** -- the units of the harmonized variable, or `N/A` if unitless. **8. `databaseStart`** -- the CHMS cycles that contain the variable, separated by commas. **9. `variableStart`** -- the source variable names as listed in each CHMS cycle. Uses the same format conventions as `variable-details.csv` (see below). ## `variable-details.csv` ```{r, echo=FALSE} cat( "There are", nrow(variable_details), "rows and", ncol(variable_details), "columns.\n" ) datatable(variable_details, options = list(pageLength = 5)) ``` ### Row structure Each row defines the recoding rule for one category of one variable. For a categorical variable with 4 categories, plus a not-applicable category, a missing category, and an else row, there are 7 rows. Missing data rows use `haven::tagged_na()`: - `NA::a` -- valid skip (not applicable) - `NA::b` -- missing (don't know, refusal, not stated) The `else` row catches values not matched by any other row. ### Columns We use `clc_sex` as a running example. **1. `variable`** -- name of the harmonized variable. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", 1], col.names = "variable") ``` **2. `dummyVariable`** -- dummy variable name for each category (categorical variables only; `N/A` for continuous). ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:2)]) ``` **3. `typeEnd`** -- variable type of the harmonized variable (`cat` or `cont`). ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:3)]) ``` **4. `databaseStart`** -- CHMS cycles containing this variable, separated by commas. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:4)]) ``` **5. `variableStart`** -- source variable names in each CHMS cycle. Supports several formats: | Format | Meaning | Example | |--------|---------|---------| | `[variable_name]` | Same name across all cycles | `[clc_sex]` | | `cycle1::name1, [default_name]` | Cycle-specific exception with a default | `cycle1::amsdmva1, [ammdmva1]` | | `DerivedVar::[var1, var2, ...]` | Computed by a function from listed inputs | `DerivedVar::[lab_bcre, pgdcgt, clc_sex, clc_age]` | ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:5)]) ``` **6. `typeStart`** -- variable type in the source CHMS data (`cat` or `cont`). ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:6)]) ``` **7. `recEnd`** -- the value to recode each category to. Special values: - `copy` -- pass through unchanged (for continuous variables) - `NA::a` -- not applicable - `NA::b` -- missing - `Func::function_name` -- derived variable computed by the named function ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:7)]) ``` **8. `numValidCat`** -- number of non-missing categories (categorical only; `N/A` for continuous). Not used by `rec_with_table()`. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:8)]) ``` **9. `catLabel`** -- short label for the category. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:9)]) ``` **10. `catLabelLong`** -- detailed label, matching CHMS documentation where possible. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:10)]) ``` **11. `units`** -- units of the variable, or `N/A`. Must be consistent across all rows of the same variable. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:11)]) ``` **12. `recStart`** -- the source value or range to match. Uses [interval notation](https://en.wikipedia.org/wiki/Interval_(mathematics)#Notations_for_intervals): - `[1, 4]` -- all integer values from 1 to 4 - `[1, 2.5]` -- all values from 1 to 2.5 (2.55 would not match) - `else` -- all values not matched by other rows - `copy` -- combined with `else`, copies unmatched values unchanged ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:12)]) ``` **13. `catStartLabel`** -- label for the source category, matching CHMS documentation. For missing rows, describes each missing code and its value. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:13)]) ``` **14. `variableStartShortLabel`** -- short label for the source variable. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:14)]) ``` **15. `variableStartLabel`** -- detailed label for the source variable, matching CHMS documentation. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:15)]) ``` **16. `notes`** -- relevant notes about changes between CHMS cycles, missing categories, or variable type changes. ```{r, echo=FALSE, warning=FALSE} kable(variable_details[variable_details$variable == "clc_sex", c(1:16)]) ``` ### Derived variables Derived variables use two special column values: - **`variableStart`**: `DerivedVar::[var1, var2, var3]` -- lists the input variables - **`recEnd`**: `Func::function_name` -- names the R function that computes the derived variable See [Derived variables](derived_variables.html) for details on how derived variables work. ## Next steps - **See it in action** -- Follow the [Analysis walkthrough](analysis_walkthrough.html) to see how these metadata files drive a real analysis. - **Understand the methodology** -- For the design rationale behind the rules-as-data approach, see [Methodology](methodology.html). - **Add your own variables** -- To extend the schema with custom variables, see [How to add variables](how_to_add_variables.html).