--- title: "Introduction to unicefData" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to unicefData} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Motivation: Data Provenance in the Age of AI This package is motivated by a fundamental principle: **data acquisition should be treated as code, not as an external preparatory step**. This is increasingly important in the age of AI-assisted research. As AI tools accelerate analytical workflows while enabling plausible fabrication of statistics and citations, anchoring research to **authoritative, version-controlled data sources** becomes essential infrastructure for scientific credibility. The unicefData package operationalizes this principle by making data provenance an integral component of your analytical script. When you specify: ```r df <- unicefData( indicator = "CME_MRY0T4", countries = c("ALB", "USA", "BRA"), year = "2015:2023" ) ``` You are not downloading a spreadsheet from a portal and applying undocumented filters. Instead, you are executing an explicit, reproducible specification of provenance: a command line that documents *what data* were requested, *from which source*, and *under what constraints*. This ensures: - **Traceability**: Others can inspect your code and verify exactly which data were used - **Transparency**: Data selection decisions are parametrized and visible - **Sustainability**: If upstream data are revised, the same command yields systematically updated results - **Reproducibility**: Your analysis remains verifiable and replicable over time ### Design Principles from wbopendata The unicefData package adopts three design principles from wbopendata (Azevedo, 2026), a similar package for World Bank data: 1. **Data acquisition as code**: Indicator selection, country coverage, and time ranges are explicit parameters in your analytical script 2. **Backward compatibility as trust infrastructure**: The package prioritizes stable syntax so analyses remain reproducible even as APIs evolve 3. **Domain-specific syntax as error prevention**: Rather than exposing HTTP requests and JSON, the interface uses familiar concepts (indicators, countries, years) that constrain input to meaningful values Together, these principles treat data acquisition as **infrastructure for reproducibility**, not mere convenience. ## The UNICEF Data Warehouse The United Nations Children's Fund (UNICEF) maintains one of the world's most comprehensive databases on child welfare, covering health, nutrition, education, protection, HIV/AIDS, and water, sanitation and hygiene (WASH). The [UNICEF Data Warehouse](https://sdmx.data.unicef.org/) uses the Statistical Data and Metadata eXchange (SDMX) standard, an ISO-certified framework for exchanging statistical information. The warehouse currently maintains **733+ indicators** organized across thematic dataflows: | Dataflow | Domain | Indicators | |----------|--------|------------| | CME | Child Mortality Estimates | 39 | | NUTRITION | Stunting, wasting, underweight | 112 | | IMMUNISATION | Immunization coverage | 18 | | WASH_HOUSEHOLDS | Water and sanitation | 57 | | EDUCATION | Education access and quality | 38 | | HIV_AIDS | HIV-related indicators | 38 | | MNCH | Maternal, newborn, child health | varies | | PT | Child protection | varies | | CHLD_PVTY | Child poverty | varies | | GENDER | Gender equality | varies | ## What is SDMX? SDMX (Statistical Data and Metadata eXchange) is an international standard for structuring and exchanging statistical data. Within SDMX: - An **indicator** is a time-series measure of a specific phenomenon (e.g., under-5 mortality rate, stunting prevalence, DTP3 coverage). - Each indicator is identified by a unique code (e.g., `CME_MRY0T4`) and belongs to a **dataflow**, which groups related indicators. - Observations are disaggregated along **dimensions**: sex, age, wealth quintile, residence (urban/rural), and maternal education. - Each observation also carries **attributes** (data source, confidence intervals, observation status) that contextualize the value. While powerful, direct SDMX API interaction requires knowledge of dataflow structures, dimension codes, and RESTful query syntax. The `unicefData` package removes these barriers. ## Why unicefData? The `unicefData` package provides a simple R interface to the UNICEF Data Warehouse. It is part of a trilingual ecosystem with identical implementations in R, Python (`unicef_api`), and Stata (`unicefdata`), sharing the same function names and parameter structures for cross-team collaboration. Key features: - **Automatic dataflow detection**: specify only the indicator code; the package finds the correct dataflow automatically. - **Discovery commands**: search indicators by keyword, list dataflows, display metadata---no prior SDMX knowledge required. - **Flexible filtering**: countries, years, sex, age, wealth quintiles, residence, maternal education. - **Multiple output formats**: long (default), wide (years as columns), wide_indicators (indicators as columns). - **Caching with memoisation**: repeated queries are fast. ## Installation Install from GitHub: ```{r install} # install.packages("devtools") devtools::install_github("unicef-drp/unicefData") ``` ## Quick start ```{r library} library(unicefData) ``` ### Discovering indicators Before downloading data, explore what is available: ```{r discovery} # Browse indicator categories (thematic dataflows) list_categories() # Search for indicators by keyword search_indicators("mortality") # List all indicators in the Child Mortality Estimates dataflow list_indicators("CME") # Get detailed information about a specific indicator get_indicator_info("CME_MRY0T4") ``` These discovery commands mirror the paper's Examples 1--4 and the Stata equivalents: ``` . unicefdata, categories . unicefdata, search(mortality) . unicefdata, indicators(CME) . unicefdata, info(CME_MRY0T4) ``` ### Basic data retrieval Fetch under-5 mortality rate for three countries over a year range: ```{r basic-retrieval} # Example 5 (paper): Basic data retrieval df <- unicefData( indicator = "CME_MRY0T4", countries = c("BRA", "IND", "CHN"), year = "2015:2023" ) head(df) ``` The equivalent Stata command is: ``` . unicefdata, indicator(CME_MRY0T4) countries(BRA IND CHN) year(2015:2023) clear ``` ### Geographic filtering Fetch data for East African countries for a single year: ```{r geographic} # Example 6 (paper): Geographic filtering df <- unicefData( indicator = "CME_MRY0T4", countries = c("KEN", "TZA", "UGA", "ETH", "RWA"), year = 2020 ) ``` ### Latest values and most recent values ```{r latest-mrv} # Example 7 (paper): Get the latest available value per country df_latest <- unicefData( indicator = "CME_MRY0T4", countries = c("BGD", "IND", "PAK"), latest = TRUE ) # Get the 3 most recent values per country df_mrv <- unicefData( indicator = "CME_MRY0T4", countries = c("BGD", "IND", "PAK"), mrv = 3 ) ``` ### Year specifications The `year` parameter supports multiple formats: ```{r year-formats} # Single year df <- unicefData(indicator = "CME_MRY0T4", year = 2020) # Year range df <- unicefData(indicator = "CME_MRY0T4", year = "2015:2023") # Non-contiguous years df <- unicefData(indicator = "CME_MRY0T4", year = "2015,2018,2020") # Circa mode: find closest available year df <- unicefData(indicator = "CME_MRY0T4", year = 2015, circa = TRUE) ``` ## Disaggregations UNICEF data supports rich disaggregation along multiple dimensions. Not all dimensions are available for all indicators---availability depends on the dataflow (see the disaggregation matrix in the package paper). ### By sex ```{r sex} # Total only (default) df <- unicefData(indicator = "CME_MRY0T4", sex = "_T") # Female only df <- unicefData(indicator = "CME_MRY0T4", sex = "F") # All sex categories (total, male, female) df <- unicefData(indicator = "CME_MRY0T4", sex = "ALL") ``` ### By wealth quintile ```{r wealth} # Example 8 (paper): Stunting by wealth and sex df <- unicefData( indicator = "NT_ANT_WHZ_NE2", countries = "IND", sex = "ALL", wealth = "ALL" ) ``` ### By residence ```{r residence} # Urban only df <- unicefData(indicator = "NT_ANT_HAZ_NE2", residence = "U") # Rural only df <- unicefData(indicator = "NT_ANT_HAZ_NE2", residence = "R") ``` ## Output formats ### Wide format (years as columns) Useful for time-series analysis: ```{r wide} # Example 9 (paper): Wide format df_wide <- unicefData( indicator = "CME_MRY0T4", countries = c("USA", "GBR", "DEU", "FRA"), year = "2000,2010,2020,2023", format = "wide" ) ``` ### Multiple indicators Fetch and merge multiple indicators automatically: ```{r multi-indicator} # Example 10 (paper): Multiple indicators df <- unicefData( indicator = c("CME_MRM0", "CME_MRY0T4"), countries = c("KEN", "TZA", "UGA"), year = 2020 ) # Wide indicators format: one column per indicator df_wide <- unicefData( indicator = c("CME_MRY0T4", "CME_MRY0", "IM_DTP3", "IM_MCV1"), countries = c("AFG", "ETH", "PAK", "NGA"), latest = TRUE, format = "wide_indicators" ) ``` ## Metadata enrichment Add regional and income group classifications: ```{r metadata} # Example 12 (paper): Regional classifications df <- unicefData( indicator = "CME_MRY0T4", add_metadata = c("region", "income_group"), latest = TRUE ) ``` ## Data cleaning and filtering Post-processing utilities for downloaded data: ```{r clean-filter} # Clean raw SDMX column names to user-friendly names df_raw <- unicefData_raw(indicator = "CME_MRY0T4", countries = "BRA") df_clean <- clean_unicef_data(df_raw) # Filter to specific disaggregations df_filtered <- filter_unicef_data(df_clean, sex = "F", wealth = "Q1") ``` ## Cache management The package caches metadata and API responses for performance. To clear and refresh all caches: ```{r cache} # Clear all caches and reload metadata clear_unicef_cache() # Clear without reloading (lazy reload on next use) clear_unicef_cache(reload = FALSE) # View cache status get_cache_info() ``` ## Dataflow schemas Inspect the structure of any dataflow: ```{r schema} # View the dimensions and attributes of a dataflow schema <- dataflow_schema("CME") print(schema) ``` ## Cross-language parity The `unicefData` ecosystem provides identical functionality across R, Python, and Stata. The same analytical workflow translates directly: | Operation | R | Python | Stata | |-----------|---|--------|-------| | Search | `search_indicators("mortality")` | `search_indicators("mortality")` | `unicefdata, search(mortality)` | | Fetch | `unicefData(indicator="CME_MRY0T4")` | `unicefData(indicator="CME_MRY0T4")` | `unicefdata, indicator(CME_MRY0T4) clear` | | Latest | `unicefData(..., latest=TRUE)` | `unicefData(..., latest=True)` | `unicefdata, ... latest clear` | | Wide | `unicefData(..., format="wide")` | `unicefData(..., format="wide")` | `unicefdata, ... wide clear` | | Cache | `clear_unicef_cache()` | `clear_cache()` | `unicefdata, clearcache` | | Sync | `sync_metadata()` | `sync_metadata()` | `unicefdata_sync, all` | This parity enables cross-team collaboration: an analyst can prototype in R and a colleague can reproduce the workflow in Stata or Python with minimal translation. ## Design Principles and Reproducibility The unicefData package embodies three design principles that make reproducibility the default rather than the exception: ### 1. Data acquisition as code When you write: ```r df <- unicefData(indicator = "CME_MRY0T4", countries = c("ALB", "USA", "BRA"), year = "2015:2023") ``` You are not performing manual steps that will be forgotten or become undocumented. Every data selection decision—indicator, countries, years, disaggregations—is explicitly specified in your script. This ensures that: - Others can **audit** your data selection - You can **reproduce** results exactly - **Revisions** to upstream data are transparent - Your analysis is **sustainable** across time ### 2. Backward compatibility as trust infrastructure The package prioritizes stable syntax and predictable behavior. This matters because: - Historical analyses remain reproducible even as the UNICEF SDMX API evolves - Your scripts from 2026 will still work in 2030 - Backward compatibility reduces the risk that methodology becomes irreproducible due to infrastructure drift ### 3. Domain-specific syntax as error prevention Rather than exposing HTTP requests and JSON parsing, the interface uses concepts familiar to development researchers: indicators, countries, years. This constrains input to meaningful values and reduces opportunities for error. ### Why This Matters for AI-Assisted Research In an era where AI tools accelerate analytical workflows, these principles become more important, not less. As generative tools lower the cost of producing plausible analyses and narratives, **anchoring empirical work to authoritative and verifiable data sources** is essential infrastructure for scientific credibility. The unicefData package provides this foundation by making data provenance explicit and executable. ## Further reading - [UNICEF Data Warehouse](https://sdmx.data.unicef.org/) - [SDMX standard documentation](https://sdmx.org/) - [Package source on GitHub](https://github.com/unicef-drp/unicefData) - **Azevedo, J.P. (2026).** "Data Provenance in the Age of Automation: Lessons from Fifteen Years of Programmatic Access to World Bank Open Data." *Stata Journal* (forthcoming) — foundational paper on design principles for data acquisition tools --- ## Acknowledgments This package was developed at the UNICEF Data and Analytics Section. The author gratefully acknowledges the collaboration of **Lucas Rodrigues**, **Yang Liu**, and **Karen Avanesian**, whose technical contributions and feedback were instrumental in the development of this R package. Special thanks to **Yves Jaques**, **Alberto Sibileau**, and **Daniele Olivotti** for designing and maintaining the UNICEF SDMX data warehouse infrastructure that makes this package possible. The author also acknowledges the **UNICEF database managers** and technical teams who ensure data quality, as well as the country office staff and National Statistical Offices whose data collection efforts make this work possible. Development of this package was supported by UNICEF institutional funding for data infrastructure and statistical capacity building. The author also acknowledges UNICEF colleagues who provided testing and feedback during development, as well as the broader open-source R community. Development was assisted by AI coding tools (GitHub Copilot, Claude). All code has been reviewed, tested, and validated by the package maintainers. ## Disclaimer **This package is provided for research and analytical purposes.** The `unicefData` package provides programmatic access to UNICEF's public data warehouse. While the author is affiliated with UNICEF, **this package is not an official UNICEF product and the statements in this documentation are the views of the author and do not necessarily reflect the policies or views of UNICEF**. Data accessed through this package comes from the [UNICEF Data Warehouse](https://sdmx.data.unicef.org/). Users should verify critical data points against official UNICEF publications at [data.unicef.org](https://data.unicef.org/). This software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or UNICEF be liable for any claim, damages or other liability arising from the use of this software. The designations employed and the presentation of material in this package do not imply the expression of any opinion whatsoever on the part of UNICEF concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. ## Data Citation and Provenance **Important Note on Data Vintages** Official statistics are subject to revisions as new information becomes available and estimation methodologies improve. UNICEF indicators are regularly updated based on new surveys, censuses, and improved modeling techniques. Historical values may be revised retroactively to reflect better information or methodological improvements. **For reproducible research and proper data attribution, users should:** 1. **Document the indicator code** - Specify the exact SDMX indicator code(s) used (e.g., `CME_MRY0T4`) 2. **Record the download date** - Note when data was accessed (e.g., "Data downloaded: 2026-02-09") 3. **Cite the data source** - Reference both the package and the UNICEF Data Warehouse 4. **Archive your dataset** - Save a copy of the exact data used in your analysis **Example citation for data used in research:** > Under-5 mortality data (indicator: CME_MRY0T4) accessed from UNICEF Data Warehouse via unicefData R package (v2.1.0) on 2026-02-09. Data available at: https://sdmx.data.unicef.org/ This practice ensures that others can verify your results and understand any differences that may arise from data updates. For official UNICEF statistics in publications, always cross-reference with the current version at [data.unicef.org](https://data.unicef.org/). ## Citation If you use this package in your research, please cite: ``` Azevedo, J.P. (2026). unicefData: Trilingual R, Python, and Stata Interface to UNICEF SDMX Data Warehouse. R package version 2.1.0. https://github.com/unicef-drp/unicefData ``` For data citations, please refer to the specific UNICEF datasets accessed through the warehouse and cite them according to UNICEF's data citation guidelines. ## License This package is released under the MIT License. See the LICENSE file for full details. ## Contact & Support - **Package Maintainer**: Joao Pedro Azevedo (jpazevedo@unicef.org) - **Report Issues**: https://github.com/unicef-drp/unicefData/issues - **UNICEF Data Portal**: https://data.unicef.org/ - **SDMX API Documentation**: https://data.unicef.org/sdmx-api-documentation/