--- title: "Getting started with pulso" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with pulso} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` # Loading GEIH microdata with pulso `pulso` provides programmatic access to Colombia's Gran Encuesta Integrada de Hogares (GEIH), the household labor force survey published monthly by DANE (Departamento Administrativo Nacional de Estadistica). ## Quick start ```{r, eval=FALSE} library(pulso) # 2024-06 is a validated period -- loads without any warning df <- pulso_load(year = 2024, month = 6, module = "ocupados") ``` The result is a tibble with the survey microdata. By default, all columns are returned with their original DANE codes (e.g., P6020, P3271). ## Validated periods and the allow_unvalidated parameter pulso maintains a registry of periods that have been manually verified against DANE published figures. As of v0.1.0-rc2, **5 periods are validated**: - 2007-12 - 2015-06 - 2021-12 - 2022-01 - 2024-06 For all other periods, `pulso_load()` raises a `pulso_data_not_validated` error by default: ```{r, eval=FALSE} # Raises pulso_data_not_validated -- 2024-09 is not yet validated df <- pulso_load(year = 2024, month = 9, module = "ocupados") # Explicitly allow unvalidated periods -- emits a visible warning df <- pulso_load(year = 2024, month = 9, module = "ocupados", allow_unvalidated = TRUE) ``` To check the validation status of a specific period: ```{r, eval=FALSE} pulso_validation_status(2024, 6) ``` Or list all validated periods: ```{r, eval=FALSE} pulso_list_validated_range() ``` ## Accessing variable metadata Pass `metadata = TRUE` to get DANE codebook information attached to the result: ```{r, eval=FALSE} df <- pulso_load(year = 2024, month = 6, module = "ocupados", metadata = TRUE) ``` You can describe individual columns: ```{r, eval=FALSE} cat(pulso_describe_column(df, "p6020")) ``` Or list metadata for all columns: ```{r, eval=FALSE} metadata_summary <- pulso_list_columns_metadata(df) print(metadata_summary) ``` ## Exploring the variable catalog pulso ships a canonical variable catalog (`variable_map.json`) that maps harmonized variable names to their epoch-specific DANE source codes. These catalog functions work offline -- no data download needed. List all canonical variables (first 10 rows): ```{r} library(pulso) vars <- pulso_list_variables() head(vars[, c("canonical_name", "module", "has_warning")], 10) ``` Describe a single canonical variable and its epoch mappings: ```{r} cat(pulso_describe_variable("sexo")) ``` Describe a survey module (reads `sources.json` bundled in the package): ```{r} cat(pulso_describe("ocupados")) ``` ## What is GEIH? GEIH is Colombia's primary labor market survey, conducted monthly since 2007. It collects data on: - Labor force participation (employed, unemployed, inactive) - Wages and informal employment - Demographic characteristics (age, sex, education) - Household composition Microdata is freely published by DANE in monthly zip files. `pulso` automates the download, parsing, and harmonization across the four GEIH design epochs (2007-2018, 2019-2023, 2024-present, plus the historical ECH 2000-2006). ## Comparison with the Python package `pulso` (R) mirrors the API of `pulso-co` (Python). For example: ```python # Python import pulso df = pulso.load(year=2024, month=6, module="ocupados", metadata=True) print(pulso.describe_column(df, "P6020")) ``` ```{r, eval=FALSE} # R library(pulso) df <- pulso_load(year = 2024, month = 6, module = "ocupados", metadata = TRUE) cat(pulso_describe_column(df, "p6020")) ``` Both packages share the same canonical data files (sources.json, variable_map.json, dane_codebook.json) via the monorepo at https://github.com/Stebandido77/pulso. ## Caching Downloaded microdata is cached at `tools::R_user_dir("pulso", "cache")` to avoid re-downloading. Pass `cache = FALSE` to force re-download. ## Breaking changes in 0.1.0-rc2 If you used `pulso_load()` in earlier development versions, note that **the default behavior has changed for unvalidated periods**: - **Before:** Loaded silently even if data was not validated - **After:** Raises `pulso_data_not_validated` unless `allow_unvalidated = TRUE` is specified This change aligns the R package with `pulso-co` (Python) and protects users from inadvertently using unvalidated data. ## Coverage and limitations `pulso` v0.1.0-rc2 supports the following: - Single year/month/module loads via `pulso_load()` - Multi-module persona-level merges via `pulso_load_merged()` - Column metadata via `pulso_describe_column()` and `pulso_list_columns_metadata()` - Module discovery via `pulso_describe()` - Canonical variable catalog via `pulso_describe_variable()` and `pulso_list_variables()` - Validation status queries via `pulso_validation_status()` and `pulso_list_validated_range()` - Coverage: 2007-01 to present (sources.json) Known limitations: - Only 5 of 230 periods are validated. Use `allow_unvalidated = TRUE` for the rest, with awareness that results may differ from DANE official tables. - Curator entries in `variable_map.json` are theoretical mappings pending empirical verification. Use `has_warning` from `pulso_list_variables()` to identify these entries. - Nested-zip periods (2024-03, 2024-04) are deferred to v0.2.0. - Mixed-level merges (persona + hogar) in `pulso_load_merged()` are deferred to v0.2.0. - ECH epoch (2000-2006) not yet supported. Planned for v0.2.0. See the GitHub issues for roadmap and known limitations.