--- title: "Download and Prepare PNADC Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Download and Prepare PNADC Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE, purl = FALSE ) ``` ## Introduction This vignette provides a complete workflow for downloading Brazil's quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps: 1. **Downloading** quarterly PNADC microdata from IBGE using the `PNADcIBGE` package 2. **Stacking** multiple quarters into a single dataset (critical for high determination rates) 3. **Applying mensalization** using the `PNADCperiods` package If you already have PNADC data and want to learn the package API first, see [Get Started](getting-started.html). For algorithm details, see [How PNADCperiods Works](how-it-works.html). --- ## Prerequisites ### Required Packages ```{r packages} # Install packages if needed install.packages(c("PNADcIBGE", "fst")) # Install PNADCperiods from GitHub # remotes::install_github("antrologos/PNADCperiods") # Load packages library(PNADcIBGE) library(data.table) library(fst) library(PNADCperiods) ``` ### System Requirements - **Disk space**: ~5 GB for 2020-2024 data, ~15 GB for full history (2012-present) - **RAM**: At least 8 GB recommended; 16 GB for comfortable processing - **Time**: 2-3 hours for downloading (depends on internet speed), ~5 minutes for processing - **Internet**: Required for downloading data and for SIDRA API access (weight calibration) --- ## Understanding PNADC Data PNADC is Brazil's primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations. **Why stack multiple quarters?** The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve **over 97% determination**. | Quarters Stacked | Month % | Fortnight % | Week % | |------------------|---------|-------------|--------| | 1 (single quarter) | ~70% | ~7% | ~2% | | 8 (2 years) | ~94% | ~9% | ~3% | | 20 (5 years) | ~95% | ~8% | ~3% | | 55+ (full history) | **~97%** | **~9%** | **~3%** | For most applications, we recommend stacking at least 2 years (8 quarters) of data. --- ## Step 1: Set Up Your Environment ```{r setup-dir} # Set your data directory (adjust path as needed) data_dir <- "path/to/your/pnadc_data/" dir.create(data_dir, recursive = TRUE, showWarnings = FALSE) ``` ## Step 2: Define Which Quarters to Download Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate: ```{r editions} # Define quarters to download (2020-2024 example) editions <- expand.grid( year = 2020:2024, quarter = 1:4 ) # If downloading recent years, filter out quarters not yet available: # editions <- editions[!(editions$year == 2025 & editions$quarter > 3), ] ``` ## Step 3: Download the Data The download loop fetches each quarter from IBGE and saves it in FST format for fast loading: ```{r download-loop} for (i in 1:nrow(editions)) { year_i <- editions$year[i] quarter_i <- editions$quarter[i] filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst") cat("Downloading:", year_i, "Q", quarter_i, "\n") # Download from IBGE pnadc_quarter <- get_pnadc( year = year_i, quarter = quarter_i, labels = FALSE, # IMPORTANT: Use numeric codes, not labels deflator = FALSE, design = FALSE, savedir = data_dir ) # Save as FST format (fast serialization) write_fst(pnadc_quarter, file.path(data_dir, filename)) # Clean up temporary files created by PNADcIBGE temp_files <- list.files(data_dir, pattern = "\\.(zip|sas|txt)$", full.names = TRUE) file.remove(temp_files) rm(pnadc_quarter) gc() } ``` > **Important**: Always use `labels = FALSE` when downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors. --- ## Step 4: Stack the Quarterly Files Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization: ```{r stack-data} # Columns needed for mensalization cols_needed <- c( # Time and identifiers "Ano", "Trimestre", "UPA", "V1008", "V1014", # Birthday variables (for reference period algorithm) "V2008", "V20081", "V20082", "V2009", # Weight and stratification (for weight calibration) "V1028", "UF", "posest", "posest_sxi" ) # Stack all quarters files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE) pnadc_stacked <- rbindlist(lapply(files, function(f) { cat("Loading:", basename(f), "\n") read_fst(f, columns = cols_needed, as.data.table = TRUE) })) cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n") ``` --- ## Step 5: Apply Mensalization Build the crosswalk (identify reference periods) and calibrate weights: ```{r mensalize} # Step 1: Build crosswalk (identify reference periods) crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE) # Check determination rates crosswalk[, .( month_rate = mean(determined_month), fortnight_rate = mean(determined_fortnight), week_rate = mean(determined_week) )] # Step 2: Apply crosswalk and calibrate weights result <- pnadc_apply_periods( data = pnadc_stacked, crosswalk = crosswalk, weight_var = "V1028", anchor = "quarter", calibrate = TRUE, calibration_unit = "month", verbose = TRUE ) ``` The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination. --- ## Step 6: Explore the Results The result contains all original columns plus reference period indicators and calibrated weights: ```{r explore} # Key new columns names(result)[grep("ref_|determined_|weight_", names(result))] # Distribution of reference months within quarters result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)] ``` Key output columns: | Column | Description | |--------|-------------| | `ref_month_in_quarter` | Position within quarter (1, 2, or 3; NA if indeterminate) | | `ref_month_yyyymm` | Reference month as YYYYMM integer (e.g., 202301) | | `determined_month` | Logical flag (TRUE if month was determined) | | `weight_monthly` | Calibrated monthly weight (if `calibrate = TRUE`) | The distribution is approximately equal across months 1, 2, and 3 (each around 31-32%), with the remaining observations having `NA` for indeterminate cases. --- ## Step 7: Save and Use the Results Save the mensalized data for future use: ```{r save} write_fst(result, file.path(data_dir, "pnadc_mensalized.fst")) ``` To compute monthly estimates, filter to determined observations and aggregate by `ref_month_yyyymm`: ```{r monthly-analysis} # Monthly unemployment rate monthly_unemployment <- result[determined_month == TRUE, .( unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) / sum((VD4001 == 1) * weight_monthly, na.rm = TRUE) ), by = ref_month_yyyymm] # Monthly population monthly_pop <- result[determined_month == TRUE, .( population = sum(weight_monthly, na.rm = TRUE) ), by = ref_month_yyyymm] ``` For more analysis examples, see [Applied Examples](applied-examples.html). --- ## Memory and Performance Tips 1. **Selective column loading**: Only load the columns you need with `read_fst(..., columns = ...)`. This dramatically reduces memory usage. 2. **Process in batches**: For very large analyses, process one year at a time and combine results. 3. **Use FST format**: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes. 4. **Clean up regularly**: Use `rm()` and `gc()` to free memory after processing each quarter. ### File Size Reference | Period | Quarters | Observations | FST Size | |--------|----------|--------------|----------| | 2020-2024 | 20 | ~8.9M | ~5 GB | | 2012-2025 | 55 | ~29M | ~15 GB | --- ## Extending to Full History For the best determination rate and longitudinal analysis, download all available quarters: ```{r full-history} # Download all available data (2012-present) editions_full <- expand.grid( year = 2012:2025, quarter = 1:4 ) editions_full <- editions_full[!(editions_full$year == 2025 & editions_full$quarter > 3), ] # Use the same download and stacking workflow as above ``` The full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month). --- ## Troubleshooting 1. **"Column not found" errors**: Ensure you used `labels = FALSE` when downloading. The algorithm requires numeric codes. 2. **Download failures**: IBGE servers can be slow or unavailable. The `PNADcIBGE` package will retry automatically, but you may need to restart interrupted downloads. 3. **Memory errors**: Try processing fewer quarters at a time, or use a machine with more RAM. 4. **SIDRA API errors**: Weight calibration requires internet access to the SIDRA API. If it fails, try again later or use `calibrate = FALSE` for reference period identification without weight calibration. --- ## Next Steps - Follow the usage patterns in [Get Started](getting-started.html) with your real data - See analysis examples in [Applied Examples](applied-examples.html) - Learn about the algorithm in [How PNADCperiods Works](how-it-works.html) > **Working with annual PNADC data?** Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See [Monthly Poverty Analysis with Annual PNADC Data](annual-poverty-analysis.html) for details on using `pnadc_apply_periods()` with `anchor = "year"`. --- ## References - HECKSHER, Marcos. "Valor Impreciso por Mes Exato: Microdados e Indicadores Mensais Baseados na Pnad Continua". IPEA - Nota Tecnica Disoc, n. 62. Brasilia, DF: IPEA, 2020. - HECKSHER, M. "Cinco meses de perdas de empregos e simulacao de um incentivo a contratacoes". IPEA - Nota Tecnica Disoc, n. 87. Brasilia, DF: IPEA, 2020. - HECKSHER, Marcos. "Mercado de trabalho: A queda da segunda quinzena de marco, aprofundada em abril". IPEA - Carta de Conjuntura, v. 47, p. 1-6, 2020. - Barbosa, Rogerio J; Hecksher, Marcos. (2026). PNADCperiods: Identify Reference Periods in Brazil's PNADC Survey Data. R package version v0.1.0.