--- title: "Building an ECH Demographics Recipe" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Building an ECH Demographics Recipe} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, warning = FALSE, message = FALSE ) ``` ## Two paths: transpile fast or build right metasurvey offers two ways to create recipes from existing STATA code: 1. **Transpile automatically** with `transpile_stata()` -- converts `.do` files to recipes in seconds. Great for migrating legacy code quickly (see `vignette("stata-transpiler")`). 2. **Build from scratch** in R -- more work upfront, but the result is cleaner, uses proper R idioms, and you control every detail. The transpiler is a pragmatic shortcut: it reads hundreds of lines of STATA and produces a working recipe, but the output inherits the original code's structure -- long `gen`/`replace` chains become long `step_recode` calls, temporary variables survive, and STATA-specific patterns (like `mvencode`) get translated literally rather than rethought. A hand-crafted recipe, on the other hand, lets you **redesign the logic** in R from the ground up. You pick meaningful variable names, combine related transformations into a single step, and skip intermediate variables that only existed because STATA needed them. The result is shorter, easier to read, and easier to maintain. This vignette builds a demographics recipe from scratch in about 20 lines of R. A transpiled version of the same pipeline would take 80+ steps and carry over variable names like `bc_pe2` and `bc_pe3` that mean nothing outside the original `.do` file. ## Setting up the survey We start with an empty Survey object. This declares the survey type and edition without loading any data yet -- the recipe will work on whatever data we feed it later. ```{r} library(metasurvey) svy <- survey_empty(type = "ech", edition = "2023") svy ``` Now let's attach some sample data. In production this would come from `anda_download_microdata("2023")` or a local file; here we simulate it. ```{r} set.seed(42) n <- 200 dt <- data.table::data.table( id = rep(1:50, each = 4), nper = rep(1:4, 50), pesoano = runif(n, 50, 300), e26 = sample(1:2, n, replace = TRUE), e27 = sample(0:90, n, replace = TRUE), e30 = sample(1:7, n, replace = TRUE), e51_2 = sample(c(0:6, -9), n, replace = TRUE), region_4 = sample(1:4, n, replace = TRUE) ) svy <- svy |> set_data(dt) ``` ## Building the pipeline Every transformation is a **step**. By default, steps are **lazy**: they record what to do without executing it. This lets you inspect and modify the pipeline before materializing the results. Compare this with the transpiler approach: `transpile_stata()` would produce one step per STATA command, faithfully preserving every `gen` and `replace`. Here we think in terms of the *output variables* we want, not the commands we need to type. ### Rename identifiers ```{r} svy <- svy |> step_rename( hh_id = "id", person_id = "nper", comment = "Standardize identifiers" ) ``` Nothing happened to the data yet: ```{r} names(get_data(svy))[1:4] ``` The original column names are still there because the step is pending. Let's keep adding steps. ### Recode sex In STATA this would be a `gen` + `replace` + `replace` sequence (3 commands). With `step_recode` it's a single, declarative mapping that produces human-readable labels: ```{r} svy <- svy |> step_recode(sex, e26 == 1 ~ "Male", e26 == 2 ~ "Female", .default = NA_character_, comment = "Sex from e26" ) ``` ### Age groups The STATA equivalent uses five `replace` lines with `inrange()`. Here we write the same logic as a single recode with readable conditions: ```{r} svy <- svy |> step_recode(age_group, e27 >= 0 & e27 <= 13 ~ "Child", e27 >= 14 & e27 <= 17 ~ "Adolescent", e27 >= 18 & e27 <= 29 ~ "Young adult", e27 >= 30 & e27 <= 64 ~ "Adult", e27 >= 65 ~ "Elderly", .default = NA_character_, comment = "Age groups from e27" ) ``` ### Relationship to head of household ```{r} svy <- svy |> step_recode(relationship, e30 == 1 ~ "Head", e30 == 2 ~ "Spouse", e30 >= 3 & e30 <= 5 ~ "Child", e30 == 6 ~ "Other relative", e30 == 7 ~ "Non-relative", .default = "Unknown", comment = "Relationship from e30" ) ``` ### Education level ```{r} svy <- svy |> step_recode(edu_level, e51_2 == 0 ~ "None", e51_2 >= 1 & e51_2 <= 2 ~ "Primary", e51_2 >= 3 & e51_2 <= 4 ~ "Secondary", e51_2 >= 5 & e51_2 <= 6 ~ "Tertiary", .default = NA_character_, comment = "Education level from e51_2" ) ``` ### Geographic area ```{r} svy <- svy |> step_recode(area, region_4 == 1 ~ "Montevideo", region_4 == 2 ~ "Urban >5k", region_4 == 3 ~ "Urban <5k", region_4 == 4 ~ "Rural", .default = NA_character_, comment = "Geographic area from region_4" ) ``` Notice that all our output variables have **meaningful labels** instead of numeric codes. A transpiled recipe would keep the original integer codes (1, 2, 3...) because that's what the STATA code used. Building from scratch lets you choose the representation that makes analysis easier. ### Remove raw variables ```{r} svy <- svy |> step_remove(e26, e27, e30, e51_2, region_4, comment = "Drop raw ECH variables" ) ``` ## Inspecting the pipeline before execution At this point we have seven pending steps. Let's see what was recorded: ```{r} length(get_steps(svy)) ``` This is one of the key advantages of building from scratch: **7 steps that each do one clear thing**. A transpiled version of the full IECON demographics module has 80+ steps because it preserves every intermediate STATA command. The pipeline is a DAG (directed acyclic graph) of transformations. `view_graph()` renders it as an interactive network -- each node is a step, and edges show variable dependencies: ```{r eval = FALSE} view_graph(svy) ``` The interactive DAG is not rendered in this vignette to keep the package size small. Run `view_graph()` in your R session to explore it. With only 7 nodes the graph is clean and navigable. Compare that with a transpiled recipe where the DAG can have 100+ nodes -- still useful for auditing, but much harder to read at a glance. For static output we can inspect the step list: ```{r} for (s in get_steps(svy)) { cat(sprintf("[%s] %s\n", s$type, s$comment %||% "")) } ``` ## Packaging as a recipe (before baking) A recipe bundles the steps with metadata so anyone can reproduce the same pipeline on different data. We create the recipe **before** baking -- the lazy steps are the pipeline: ```{r} rec <- steps_to_recipe( name = "ECH Demographics (minimal)", user = "research_team", svy = svy, steps = get_steps(svy), description = paste( "Harmonized demographics: sex, age group, relationship,", "education level, and geographic area." ), topic = "demographics" ) rec ``` The recipe auto-generates documentation from the steps: ```{r} doc <- rec$doc() cat("Input variables: ", paste(doc$input_variables, collapse = ", "), "\n") cat("Output variables:", paste(doc$output_variables, collapse = ", "), "\n") cat("Pipeline steps: ", length(doc$pipeline), "\n") ``` ## Baking: materializing the pipeline Now let's execute the steps. `bake_steps()` runs all pending steps in order and returns the transformed survey: ```{r} svy <- bake_steps(svy) ``` The data has the new columns with readable labels: ```{r} head(get_data(svy)[, .( hh_id, person_id, sex, age_group, relationship, edu_level, area )]) ``` The raw variables are gone: ```{r} "e26" %in% names(get_data(svy)) ``` ## Saving and loading Recipes serialize to JSON for version control and sharing: ```{r} f <- tempfile(fileext = ".json") save_recipe(rec, f) ``` ```{r} rec2 <- read_recipe(f) rec2$name length(rec2$steps) ``` The JSON is human-readable and diffable in git: ```{r} cat(readLines(f, n = 15), sep = "\n") ``` ## Applying to a new edition The same recipe works on any edition. Load the recipe from JSON, attach it to new data, and bake: ```{r} rec_loaded <- read_recipe(f) svy_2024 <- survey_empty(type = "ech", edition = "2024") |> set_data(data.table::data.table( id = rep(1:30, each = 3), nper = rep(1:3, 30), pesoano = runif(90, 50, 300), e26 = sample(1:2, 90, replace = TRUE), e27 = sample(0:90, 90, replace = TRUE), e30 = sample(1:7, 90, replace = TRUE), e51_2 = sample(c(0:6, -9), 90, replace = TRUE), region_4 = sample(1:4, 90, replace = TRUE) )) |> add_recipe(rec_loaded) |> bake_recipes() head(get_data(svy_2024)[, .(hh_id, person_id, sex, age_group, area)]) ``` No code changes needed. The recipe encodes the *logic*, not the data. ## Transpiler vs hand-crafted: when to use each | | Transpiler (`transpile_stata()`) | Hand-crafted recipe | |---|---|---| | **Speed** | Seconds -- instant migration | Hours -- requires understanding the logic | | **Steps** | 80-200 per module (one per STATA line) | 5-20 (one per concept) | | **Variable names** | Inherits STATA names (`bc_pe2`, `bc_pe3`) | Your own names (`sex`, `age_group`) | | **Labels** | Numeric codes (`1`, `2`, `3`) | Readable labels (`"Male"`, `"Female"`) | | **Readability** | Faithful to original, verbose | Clean, self-documenting | | **Maintenance** | Hard to modify individual steps | Easy to change any mapping | | **DAG visualization** | Large, hard to read | Compact, meaningful nodes | | **Best for** | Migrating legacy code fast | New projects, critical pipelines | **Recommended workflow**: use `transpile_stata()` to migrate your existing `.do` files immediately so you have a working baseline. Then gradually replace transpiled recipes with hand-crafted ones as you review each module. The transpiled version keeps you running; the hand-crafted version is where you want to end up. ## What metasurvey gives you | Manual STATA scripts | metasurvey recipe | |---|---| | Copy-paste `.do` files per year | One recipe, any edition | | Undocumented variable names | Auto-generated input/output docs | | No dependency tracking | DAG visualization with `view_graph()` | | Flat scripts, no validation | `validate()` checks required variables | | Email `.do` files to colleagues | `publish_recipe()` to shared registry | | Re-run entire script to test | Lazy steps: inspect before baking | ## Next steps - **Add more variables**: income, labor force status, housing conditions -- each can be a separate recipe with `depends_on_recipes` - **Certify your recipe**: use `certify_recipe()` to mark it as reviewed or official - **Publish**: `publish_recipe(rec)` uploads to the shared registry where others can find it with `search_recipes(topic = "demographics")` - **Start from STATA**: if you already have `.do` files, use `transpile_stata()` to generate a working baseline immediately -- see `vignette("stata-transpiler")` -- then refine the output into a hand-crafted recipe like the one in this vignette