--- title: "Preparing STEPS Data for Analysis" author: "Abhijit Pakhare" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Preparing STEPS Data for Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Introduction This guide helps you prepare your WHO STEPS survey data file for use with the `stepssurvey` package. It covers the variables the package expects, how auto-detection works, common data quality issues, and how to resolve mismatches between your data and the package expectations. The package is designed to work with STEPS data from any country, regardless of instrument version (v3.1 or v3.2) or data management system (Epi Info, SPSS, Stata, Excel). ## Supported file formats | Format | Extension | Typical source | Reader used | |--------|-----------|----------------|-------------| | SPSS | `.sav` | WHO STEPS data entry / Epi Info export | `haven::read_spss()` | | Stata | `.dta` | WHO analysis template | `haven::read_dta()` | | Excel | `.xlsx` | Custom data entry | `readxl::read_excel()` | | CSV | `.csv` | Any spreadsheet export | `readr::read_csv()` | **Recommendation**: Use the `.sav` file directly as exported from your data management system. The package preserves SPSS variable labels during column detection (to disambiguate codes like A1 that mean different things across versions) and then strips them before analysis to avoid compatibility issues. ## Minimum required variables At a minimum, the package needs **age** and **sex** to produce any output. Beyond that, each additional variable you provide enables more indicators and tables. ### Essential (required) | Variable | STEPS codes | Description | |----------|-------------|-------------| | Age | `C3` (v3.2), `age`, `c1` (v3.1) | Respondent age in completed years | | Sex | `C1` (v3.2), `sex`, `gender`, `c2` (v3.1) | Male/Female coding (1/2, M/F, or text) | ### Strongly recommended | Variable | STEPS codes | Description | |----------|-------------|-------------| | Sampling weight (Step 1) | `WStep1`, `wt_final`, `sampleweight` | Probability weight for behavioural module | | Sampling weight (Step 2) | `WStep2` | Weight for physical measurements | | Sampling weight (Step 3) | `WStep3` | Weight for biochemical measurements | | PSU / Cluster | `psu`, `cluster`, `I1`, `ea_id` | Primary sampling unit identifier | | Stratum | `stratum`, `strata`, `district`, `region` | Stratification variable | If only one weight column is present, it is used for all three steps. If no weight is found, the package assumes equal weights (simple random sample). ### Step 1: Behavioural risk factors **Tobacco:** | Variable | v3.1 code | v3.2 code | Description | |----------|-----------|-----------|-------------| | Current smoker | T1 | T1 | Currently smoke tobacco (yes/no) | | Daily smoker | T2 | T2 | Smoke daily (yes/no) | | Age started | T3 | T3 | Age of smoking initiation | | Cigarettes/day | T5a | T5a | Manufactured cigarettes per day | | Quit attempt | T6 | T6 | Tried to quit in past 12 months | | Past smoker | T8 | T8 | Ever smoked in the past | | Smokeless tobacco | T12/T15 | T12 | Current smokeless use | | Second-hand (home) | T17 | T17 | Exposure to smoke at home | | Second-hand (work) | T18 | T18 | Exposure to smoke at workplace | **Alcohol:** | Variable | v3.1 code | v3.2 code | Description | |----------|-----------|-----------|-------------| | Ever consumed | -- | A1 | Lifetime alcohol consumption | | Past 12 months | A2/A4 | A2 | Consumed in past year | | Current (30 days) | A1 | A5 | Consumed in past 30 days | | Occasions (30 days) | A6 | A6 | Number of drinking occasions | | Drinks per occasion | A7 | A7 | Typical number of drinks | | Heavy episodic | -- | A9 | Times with 6+ drinks (30 days) | **Note on A1/A5 ambiguity**: In v3.1, `A1` means "current drinker (past 30 days)". In v3.2, `A1` means "ever consumed alcohol" and `A5` is "past 30 days". The package uses SPSS variable labels to disambiguate when the column code alone is ambiguous. This is one reason why `.sav` files (which carry labels) work better than plain CSV. **Diet:** | Variable | v3.1 code | v3.2 code | Description | |----------|-----------|-----------|-------------| | Fruit days/week | D1 | D1 | Days eating fruit in typical week | | Fruit servings/day | D2 | D2 | Servings of fruit on those days | | Vegetable days/week | D3 | D3 | Days eating vegetables | | Vegetable servings/day | D4 | D4 | Servings of vegetables on those days | | Salt at table | D5 | D5 | Frequency of adding salt | | Processed salt food | D7 | D7 | Frequency of processed salty food | **Physical Activity (GPAQ):** | Variable | v3.2 code | Description | |----------|-----------|-------------| | Vigorous work (y/n) | P1 | Does vigorous work activity | | Vigorous work days | P2 | Days per week | | Vigorous work hours | P3a | Hours per day | | Vigorous work minutes | P3b | Minutes per day | | Moderate work (y/n) | P4 | Does moderate work activity | | Moderate work days | P5 | Days per week | | Moderate work hours | P6a | Hours per day | | Moderate work minutes | P6b | Minutes per day | | Transport (y/n) | P7 | Walks or cycles for transport | | Transport days | P8 | Days per week | | Transport hours | P9a | Hours per day | | Transport minutes | P9b | Minutes per day | | Vigorous recreation (y/n) | P10 | Does vigorous recreational activity | | Vigorous recreation days | P11 | Days per week | | Vigorous recreation hours | P12a | Hours per day | | Vigorous recreation minutes | P12b | Minutes per day | | Moderate recreation (y/n) | P13 | Does moderate recreational activity | | Moderate recreation days | P14 | Days per week | | Moderate recreation hours | P15a | Hours per day | | Moderate recreation minutes | P15b | Minutes per day | | Sedentary hours | P16a | Sitting time, hours per day | | Sedentary minutes | P16b | Sitting time, minutes per day | The package computes MET-minutes/week from these raw items using WHO MET multipliers: vigorous activities × 8 MET, moderate and transport activities × 4 MET. The `insufficient_pa` indicator (< 600 MET-minutes/week) is then derived automatically. If your dataset already has a pre-computed `met_total` variable, the package uses it directly instead of calculating from raw items. ### Step 2: Physical measurements | Variable | v3.1 code | v3.2 code | Description | |----------|-----------|-----------|-------------| | Height (cm) | M1 | M11 | Standing height | | Weight (kg) | M2 | M12 | Body weight | | Waist (cm) | M3 | M14 | Waist circumference | | Hip (cm) | -- | M15 | Hip circumference | | SBP reading 1 | B1 | M4a | First systolic BP | | SBP reading 2 | B3 | M5a | Second systolic BP | | SBP reading 3 | B5 | M6a | Third systolic BP | | DBP reading 1 | B2 | M4b | First diastolic BP | | DBP reading 2 | B4 | M5b | Second diastolic BP | | DBP reading 3 | B6 | M6b | Third diastolic BP | | BP medication | B7/H3 | M7 | Currently on antihypertensives | | Heart rate 1 | -- | M16a | First heart rate reading | | Heart rate 2 | -- | M16b | Second heart rate reading | | Heart rate 3 | -- | M16c | Third heart rate reading | Blood pressure: The package averages the last two of three readings (WHO protocol). If only two readings are available, their average is used. If only one reading is available, it is used directly. ### Step 3: Biochemical measurements | Variable | v3.1 code | v3.2 code | Description | |----------|-----------|-----------|-------------| | Fasting glucose | C1 | B5 | Fasting blood glucose (mmol/L) | | Diabetes meds | C5 | B6/H8 | Currently on diabetes medication | | Total cholesterol | C6 | B8 | Total cholesterol (mmol/L) | | Cholesterol meds | C10 | B9/H14 | Currently on cholesterol medication | | HDL cholesterol | -- | B17 | HDL cholesterol (mmol/L) | | Triglycerides | -- | B16 | Fasting triglycerides (mmol/L) | ### Health history (H-codes) | Variable | Code | Description | |----------|------|-------------| | BP ever measured | H1 | Ever had BP measured by health worker | | BP diagnosed | H2a | Ever told by doctor that BP is raised | | Glucose ever measured | H6 | Ever had blood sugar measured | | DM diagnosed | H7a | Ever told by doctor that blood sugar is raised | | Cholesterol ever measured | H12 | Ever had cholesterol measured | | Cholesterol diagnosed | H13a | Ever told by doctor that cholesterol is raised | | CVD history | H17 | History of heart attack, angina, or stroke | | Aspirin use | H18 | Currently taking aspirin regularly | | Statin use | H19 | Currently taking statins regularly | | Advised: quit tobacco | H20a | Doctor/health worker advised to quit tobacco | | Advised: reduce salt | H20b | Advised to reduce salt intake | | Advised: eat fruit/veg | H20c | Advised to eat more fruit/vegetables | | Advised: reduce fat | H20d | Advised to reduce dietary fat | | Advised: more PA | H20e | Advised to increase physical activity | | Advised: healthy weight | H20f | Advised to maintain healthy body weight | ## How auto-detection works When you call `detect_steps_columns(data)` (or upload a file in the Shiny app), the package searches for each expected variable using a prioritised alias list. For example, the fasting glucose variable is searched for as: ``` b5, b5_mmol, c1_mmol, fasting_glucose, glucose_fasting, fbg, fpg ``` The search is case-insensitive and uses the column names after `janitor::clean_names()` has standardised them. For ambiguous codes (like A1 which means different things in v3.1 vs v3.2), the package also checks the SPSS variable label to disambiguate. This is why `.sav` files produce the most reliable auto-detection. After detection, you can inspect the mapping: ```{r inspect} raw <- import_steps_data("my_data.sav") cols <- detect_steps_columns(raw) # See all detected columns str(cols[!sapply(cols, is.null)]) # See what was NOT detected names(cols[sapply(cols, is.null)]) ``` ## Common data issues and solutions ### Issue: Sex coded as numeric without labels Some datasets code sex as 1/2 without clear labels. The package handles this automatically using the WHO STEPS convention: 1 = Male, 2 = Female. If your data uses a different coding, recode before analysis or override the column. ### Issue: Yes/No variables coded inconsistently STEPS datasets use various codings for binary variables: 1/2 (yes/no), 0/1, "Yes"/"No", "Y"/"N". The `recode_yn()` function handles all of these automatically. It treats 1 = Yes and 2 = No (the WHO convention), as well as 0/1 where 1 = Yes. ### Issue: Glucose values in mg/dL instead of mmol/L The package expects biochemical values in mmol/L (the WHO standard). If your data uses mg/dL, convert before analysis: ```{r convert} # Glucose: mg/dL to mmol/L (divide by 18) raw$b5 <- raw$b5 / 18 # Cholesterol: mg/dL to mmol/L (divide by 38.67) raw$b8 <- raw$b8 / 38.67 ``` ### Issue: Multiple datasets for different STEPS steps Some surveys store Step 1 (interview), Step 2 (measurements), and Step 3 (blood tests) in separate files. Merge them by respondent ID before importing: ```{r merge} step1 <- haven::read_spss("step1_interview.sav") step2 <- haven::read_spss("step2_measurements.sav") step3 <- haven::read_spss("step3_biochemistry.sav") combined <- dplyr::left_join(step1, step2, by = "pid") |> dplyr::left_join(step3, by = "pid") # Save combined file haven::write_sav(combined, file.path(tempdir(), "steps_combined.sav")) # Or import directly raw <- combined |> janitor::clean_names() ``` ### Issue: Missing sampling weights If your dataset does not include sampling weights, the package will proceed with equal weights (equivalent to assuming a simple random sample). This produces valid point estimates but confidence intervals may not correctly reflect the true survey design. For proper analysis, you need at minimum one weight variable. The WHO STEPS toolkit computes three step-specific weights (WStep1, WStep2, WStep3) that account for non-response at each step. Contact your survey statistician if weights are not in the data file. ### Issue: Variables detected as wrong type If a numeric variable is stored as character (common with Epi Info exports), the cleaning step will attempt to convert it. If conversion fails, the variable is set to NA with a warning message. Check the console for messages like "NAs introduced by coercion". ## Pre-flight checklist Before running the analysis, verify these items: 1. **File format**: Preferably `.sav` (preserves variable labels for better auto-detection). CSV works but may require more manual column mapping. 2. **One row per respondent**: The data should be in wide format with one row per survey participant and columns for each variable. 3. **Age and sex present**: These are the only truly required variables. Verify they contain reasonable values (age should be numeric, sex should have exactly two levels). 4. **Weights present**: Check for columns named `WStep1`, `WStep2`, `WStep3`, or similar. If using a single weight, ensure it maps to `weight_step1`. 5. **Biochemical units**: Glucose should be in mmol/L (typical values 3--15), cholesterol in mmol/L (typical values 2--10). Values in hundreds suggest mg/dL units that need conversion. 6. **Blood pressure readings**: Should be in mmHg. Typical SBP range is 80--250, typical DBP range is 40--150. Values outside this range are set to NA during cleaning. 7. **No duplicate respondent IDs**: If the same person appears multiple times, prevalence estimates will be biased. 8. **Consistent coding**: Binary variables should use the same coding scheme throughout (ideally 1 = Yes, 2 = No per WHO convention). ## Quick diagnostic script Run this after importing to check data quality: ```{r diagnostic} library(stepssurvey) raw <- import_steps_data("my_data.sav") cols <- detect_steps_columns(raw) # Summary cat("Rows:", nrow(raw), "\n") cat("Columns:", ncol(raw), "\n") cat("Detected:", sum(!sapply(cols, is.null)), "/", length(cols), "\n") # Check key variables if (!is.null(cols$age)) { cat("\nAge range:", range(raw[[cols$age]], na.rm = TRUE), "\n") cat("Age NAs:", sum(is.na(raw[[cols$age]])), "\n") } if (!is.null(cols$sex)) { cat("\nSex distribution:\n") print(table(raw[[cols$sex]], useNA = "ifany")) } if (!is.null(cols$weight_step1)) { wt <- raw[[cols$weight_step1]] cat("\nWeight range:", round(range(wt, na.rm = TRUE), 3), "\n") cat("Weight NAs:", sum(is.na(wt)), "\n") } # List undetected variables missing <- names(cols[sapply(cols, is.null)]) cat("\nUndetected variables (", length(missing), "):\n") cat(paste(" ", missing, collapse = "\n"), "\n") ``` ## Further reading - `vignette("stepssurvey-guide")` -- full API documentation - `vignette("shiny-walkthrough")` -- interactive Shiny app guide - [WHO STEPS Manual, Part 4: Data Analysis](https://www.who.int/teams/noncommunicable-diseases/surveillance/systems-tools/steps/manuals) - [WHO STEPS Instrument v3.2](https://www.who.int/teams/noncommunicable-diseases/surveillance/systems-tools/steps/instrument)