--- title: "Transpiling STATA do-files to metasurvey Recipes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Transpiling STATA do-files to metasurvey Recipes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, warning = FALSE, message = FALSE ) ``` ## Motivation Many Latin American research groups maintain decades of STATA `.do` files that process household survey microdata. These scripts encode institutional knowledge about variable harmonization, income decomposition, and indicator construction -- but they are locked in a format that is hard to version, share, or integrate with modern R workflows. The **metasurvey transpiler** converts STATA `.do` files into metasurvey Recipe objects. This enables: - **Reproducibility**: STATA pipelines become version-controlled JSON recipes - **Interoperability**: the same recipe runs on any `data.table`-backed Survey object - **Discovery**: transpiled recipes can be published to the metasurvey API for other researchers to find and reuse The transpiler handles the most common STATA patterns found in survey processing scripts: variable generation, conditional replacement, recoding, aggregation, loops, missing-value encoding, and label extraction. ## Quick start ```r library(metasurvey) # Transpile a single .do file result <- transpile_stata("demographics.do") result$steps[1:3] #> [1] "step_rename(svy, hh_id = \"id\", person_id = \"nper\")" #> [2] "step_compute(svy, weight_yr = pesoano)" #> [3] "step_compute(svy, sex = e26)" ``` The result is a list with four elements: | Element | Description | |---|---| | `steps` | Character vector of metasurvey step calls | | `labels` | Variable and value labels extracted from the `.do` file | | `warnings` | Any commands that required manual review | | `stats` | Counts of translated, skipped, and manual-review commands | ## Transpilation pipeline The transpiler works in four passes: ``` .do file | v [1] parse_do_file() -- tokenize lines into command objects | v [2] translate_commands() -- map each STATA command to metasurvey steps | v [3] optimize_steps() -- consolidate consecutive renames, drops, etc. | v [4] Recipe / JSON -- bundle steps with metadata ``` ### Pass 1: Parsing `parse_do_file()` reads a `.do` file and produces a list of command objects. It handles: - **Comment stripping**: `*`, `//`, and `/* */` block comments - **Line continuation**: `///` and `/* */` used as continuation markers - **Loop expansion**: `foreach` and `forvalues` are unrolled, substituting backtick macros (`` `var' ``) with each iteration value - **Prefix handling**: `capture`, `bysort group:`, and command abbreviations (`g` for `gen`, `cap` for `capture`) ```r commands <- parse_do_file("demographics.do") commands[[1]] #> $cmd #> [1] "gen" #> $args #> [1] "sex = e26" #> $if_clause #> NULL #> $by_group #> NULL #> $capture #> [1] FALSE ``` ### Pass 2: Command translation Each parsed command is mapped to one or more metasurvey step strings. The following table shows the supported STATA commands and their translations. ## Supported patterns ### gen / generate Simple variable creation translates to `step_compute`: ```stata gen sex = q01 gen is_urban = (region < 3) gen byte age_group = -9 ``` ```r step_compute(svy, sex = q01) step_compute(svy, is_urban = (region < 3)) step_compute(svy, age_group = -9L) ``` When a `gen` includes an `if` clause, the condition is wrapped in `fifelse`: ```stata gen employed = hours_worked if age >= 14 ``` ```r step_compute(svy, employed = data.table::fifelse(age >= 14, hours_worked, NA)) ``` ### gen + replace chains (the dominant pattern) The most common pattern in survey `.do` files is initializing a variable and then filling it with conditional replacements: ```stata gen relationship = -9 replace relationship = 1 if q05 == 1 replace relationship = 2 if q05 == 2 replace relationship = 3 if inrange(q05, 3, 5) replace relationship = 4 if q05 == 6 ``` When **all** right-hand sides are constants, the transpiler emits a single `step_recode`: ```r step_recode(svy, relationship, q05 == 1 ~ 1L, q05 == 2 ~ 2L, q05 >= 3 & q05 <= 5 ~ 3L, q05 == 6 ~ 4L, .default = -9L) ``` When any right-hand side is an **expression**, the transpiler emits a chain of `step_compute` with `fifelse`: ```stata gen labour_inc = 0 replace labour_inc = wage if job_type == 1 replace labour_inc = wage + bonus if job_type == 2 ``` ```r step_compute(svy, labour_inc = 0L) step_compute(svy, labour_inc = data.table::fifelse( job_type == 1, wage, labour_inc)) step_compute(svy, labour_inc = data.table::fifelse( job_type == 2, wage + bonus, labour_inc)) ``` ### recode STATA `recode` with parenthesized mappings or inline syntax: ```stata recode urban_filter (0=2) recode edu_level (2=2) (3=-9) (4=3) (5=4), gen(edu_compat) recode var1 var2 var3 .=0 ``` ```r step_compute(svy, urban_filter = data.table::fifelse( urban_filter == 0, 2, urban_filter)) step_compute(svy, edu_compat = edu_level) step_compute(svy, edu_compat = data.table::fifelse( edu_compat == 2, 2, edu_compat)) # ... one fifelse per mapping # Multi-variable recode: one step per variable step_compute(svy, var1 = data.table::fifelse(is.na(var1), 0, var1)) step_compute(svy, var2 = data.table::fifelse(is.na(var2), 0, var2)) step_compute(svy, var3 = data.table::fifelse(is.na(var3), 0, var3)) ``` ### egen (aggregation with by-groups) ```stata bysort household: egen hh_income = sum(income) egen max_age = max(age), by(household) ``` ```r step_compute(svy, hh_income = sum(income, na.rm = TRUE), .by = "household") step_compute(svy, max_age = max(age, na.rm = TRUE), .by = "household") ``` Supported `egen` functions: `sum`, `max`, `min`, `mean`, `count`, `sd`, `median`, `total`, `rowtotal`, `rowmean`. ### foreach loops Loops are expanded during parsing. The transpiler unrolls `foreach` with both `in` lists and `of numlist` ranges, including nested loops: ```stata foreach i of numlist 1/4 { gen contrib`i' = 0 replace contrib`i' = amount if provider == `i' } ``` Expands to 4 pairs of gen+replace, each transpiled independently: ```r step_recode(svy, contrib1, provider == 1 ~ amount, .default = 0L) step_recode(svy, contrib2, provider == 2 ~ amount, .default = 0L) step_recode(svy, contrib3, provider == 3 ~ amount, .default = 0L) step_recode(svy, contrib4, provider == 4 ~ amount, .default = 0L) ``` ### mvencode (missing value encoding) ```stata mvencode income_1 income_2 income_3, mv(0) ``` ```r step_compute(svy, income_1 = data.table::fifelse( is.na(income_1), 0, income_1)) step_compute(svy, income_2 = data.table::fifelse( is.na(income_2), 0, income_2)) step_compute(svy, income_3 = data.table::fifelse( is.na(income_3), 0, income_3)) ``` ### destring ```stata destring wage bonus, replace force ``` ```r step_compute(svy, wage = suppressWarnings( as.numeric(as.character(wage)))) step_compute(svy, bonus = suppressWarnings( as.numeric(as.character(bonus)))) ``` ### rename, drop, keep ```stata rename id hh_id drop aux1 aux2 aux3 ``` ```r step_rename(svy, hh_id = "id") step_remove(svy, aux1, aux2, aux3) ``` Consecutive renames are consolidated into a single `step_rename` call, and consecutive drops are merged into one `step_remove`. ### STATA expression translation STATA-specific syntax in expressions is automatically translated: | STATA | R (data.table) | |---|---| | `inrange(x, a, b)` | `(x >= a & x <= b)` | | `inlist(x, 1, 2, 3)` | `(x %in% c(1, 2, 3))` | | `var == .` | `is.na(var)` | | `var != .` | `!is.na(var)` | | `.` (as value) | `NA` | | `string(var)` | `as.character(var)` | | `var[_n-1]` | `data.table::shift(var, 1, type = "lag")` | | `var[_n+1]` | `data.table::shift(var, 1, type = "lead")` | | `_N` | `.N` | ### Variable ranges STATA allows variable ranges like `aux1-aux4` meaning `aux1 aux2 aux3 aux4`. The transpiler expands these in `drop`, `recode`, and `mvencode` commands: ```stata drop contrib1-contrib4 ``` ```r step_remove(svy, contrib1, contrib2, contrib3, contrib4) ``` ### Labels Variable and value labels are extracted and stored in the recipe metadata: ```stata lab var sex "Sex of respondent" lab def sex_lbl 1 "Male" 2 "Female" lab val sex sex_lbl ``` ```r result$labels #> $var_labels #> $var_labels$sex #> [1] "Sex of respondent" #> #> $val_labels #> $val_labels$sex #> $val_labels$sex$`1` #> [1] "Male" #> $val_labels$sex$`2` #> [1] "Female" ``` ## Skipped commands Commands that do not modify survey data are silently skipped during transpilation. These include: - I/O: `use`, `save`, `import`, `export`, `insheet`, `outsheet` - Display: `tabulate`, `summarize`, `describe`, `list`, `browse`, `display` - Control flow: `if`/`else`, `while`, `program`, `exit` - Settings: `set`, `sort`, `order`, `compress`, `format` - Macros: `global`, `local`, `scalar`, `matrix` The `$stats` element of the result reports how many commands fell into each category. ## A realistic example The following `.do` file is a simplified version of a typical survey demographics module. It is not a real production script, but it uses the same patterns found in actual ECH processing pipelines. Save this as `demo_module.do`: ```stata * ────────────────────────────────────────────── * Demographics module -- simplified example * ────────────────────────────────────────────── rename id hh_id rename nper person_id gen weight_yr = pesoano gen weight_qt = pesotri * ── Sex ── gen sex = q01 * ── Relationship to head ── g relationship = -9 replace relationship = 1 if q05 == 1 replace relationship = 2 if q05 == 2 replace relationship = 3 if inrange(q05, 3, 5) replace relationship = 4 if q05 == 6 replace relationship = 5 if q05 == 7 * ── Area ── gen area = . replace area = 1 if region == 1 replace area = 2 if region == 2 replace area = 3 if region == 3 * ── Education level (harmonized) ── recode q20 (2=2) (3=-9) (4=3) (5=4), gen(edu_compat) * ── Household-level age stats ── bysort hh_id: egen max_age = max(edad) bysort hh_id: egen n_members = count(person_id) * ── Initialize health insurance contributions ── foreach i of numlist 1/3 { gen contrib`i' = 0 replace contrib`i' = amount if provider == `i' } * ── Encode missing values ── mvencode contrib1 contrib2 contrib3, mv(0) * ── Clean up ── drop region q01 q05 q20 * ── Labels ── lab var sex "Sex" lab var relationship "Relationship to household head" lab def sex_lbl 1 "Male" 2 "Female" lab val sex sex_lbl lab def rel_lbl 1 "Head" 2 "Spouse" 3 "Child" 4 "Other relative" 5 "Non-relative" lab val relationship rel_lbl ``` ```{r} library(metasurvey) # Write the example do-file to a temp location # Note: STATA macros use backtick-quote (`var') which we build with paste0 bt <- "`" # backtick sq <- "'" # single quote do_lines <- c( "rename id hh_id", "rename nper person_id", "gen weight_yr = pesoano", "gen weight_qt = pesotri", "gen sex = q01", "g relationship = -9", "replace relationship = 1 if q05 == 1", "replace relationship = 2 if q05 == 2", "replace relationship = 3 if inrange(q05, 3, 5)", "replace relationship = 4 if q05 == 6", "replace relationship = 5 if q05 == 7", "gen area = .", "replace area = 1 if region == 1", "replace area = 2 if region == 2", "replace area = 3 if region == 3", "recode q20 (2=2) (3=-9) (4=3) (5=4), gen(edu_compat)", "bysort hh_id: egen max_age = max(edad)", "bysort hh_id: egen n_members = count(person_id)", "foreach i of numlist 1/3 {", paste0("gen contrib", bt, "i", sq, " = 0"), paste0("replace contrib", bt, "i", sq, " = amount if provider == ", bt, "i", sq), "}", "mvencode contrib1 contrib2 contrib3, mv(0)", "drop region q01 q05 q20", 'lab var sex "Sex"', 'lab var relationship "Relationship to household head"', 'lab def sex_lbl 1 "Male" 2 "Female"', "lab val sex sex_lbl", 'lab def rel_lbl 1 "Head" 2 "Spouse" 3 "Child" 4 "Other relative" 5 "Non-relative"', "lab val relationship rel_lbl" ) do_file <- tempfile(fileext = ".do") writeLines(do_lines, do_file) result <- transpile_stata(do_file) ``` ### Inspecting the output ```{r} cat("Translated:", result$stats$translated, "\n") cat("Skipped: ", result$stats$skipped, "\n") cat("Manual: ", result$stats$manual_review, "\n") ``` ```{r} # Print the generated steps for (s in result$steps) cat(s, "\n") ``` ### Labels ```{r} str(result$labels$var_labels) str(result$labels$val_labels) ``` ### Building a Recipe from transpiled steps ```r rec <- Recipe$new( id = "example_demographics", name = "Demographics (transpiled)", user = "research_team", edition = "2022", survey_type = "ech", default_engine = "data.table", depends_on = character(0), description = "Harmonized demographics from STATA transpilation", steps = result$steps, labels = result$labels ) # Save as JSON save_recipe(rec, "demographics_recipe.json") # Apply to survey data svy <- survey_empty(type = "ech", edition = "2022") |> set_data(my_data) |> add_recipe(rec) |> bake_recipes() ``` ## Module-level transpilation For projects that organize `.do` files by year and thematic module, `transpile_stata_module()` processes an entire year directory and groups the results into separate Recipe objects: ```r recipes <- transpile_stata_module( year_dir = "do_files/2022", year = 2022, user = "research_team", output_dir = "recipes/" ) names(recipes) #> [1] "data_prep" "demographics" "income_detail" #> [4] "income_aggregate" "cleanup" # Each recipe has inter-module dependencies recipes$income_detail$depends_on_recipes #> [1] "ech_2022_data_prep" "ech_2022_demographics" ``` ## Coverage analysis `transpile_coverage()` reports how many commands in a `.do` file (or directory) can be automatically transpiled: ```r transpile_coverage("do_files/") #> file total translated skipped manual coverage #> 1 2022/2_correc_datos.do 82 60 22 0 100.00 #> 2 2022/3_compatibiliz...do 420 380 40 0 100.00 #> 3 2022/4_ingreso_ht11...do 310 270 40 0 100.00 ``` The `coverage_pct` column reports the percentage of **data-transforming** commands that were translated (excluding skipped non-data commands). A value below 100% means some commands need manual review -- look at the `$warnings` element for details. ## Limitations The transpiler does **not** handle: - `merge` commands (these depend on external files and are translated as comments with `# MANUAL_REVIEW`) - `collapse` / `reshape` (structural data transformations) - Custom `program` definitions - `mata` blocks or `plugin` calls - Nested `/* */` block comments that contain `/* */` line continuations internally (rare; only seen in commented-out legacy code) Commands that fall outside the transpiler's scope are flagged with `# MANUAL_REVIEW` in the output and counted in `$stats$manual_review`. ## Summary | Feature | Status | |---|---| | gen / generate | Fully supported | | replace (conditional) | Fully supported | | gen + replace chains | Auto-grouped into step_recode or step_compute | | recode (single & multi-var) | Fully supported | | egen with by-groups | Fully supported | | foreach / forvalues | Expanded during parsing | | mvencode | Fully supported | | destring / tostring | Fully supported | | rename / drop / keep | Fully supported | | Variable & value labels | Extracted to recipe metadata | | STATA expressions | inrange, inlist, missing, lag/lead, \_N | | Variable ranges | Expanded (e.g., var1-var4) | | Nested loops | Recursive expansion | | Line continuation (///) | Joined during parsing | | capture prefix | Handled (errors suppressed) | | bysort prefix | Converted to .by parameter |