---
title: "Download and Prepare PNADC Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Download and Prepare PNADC Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  eval = FALSE,
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  purl = FALSE
)
```

## Introduction

This vignette provides a complete workflow for downloading Brazil's quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps:

1. **Downloading** quarterly PNADC microdata from IBGE using the `PNADcIBGE` package
2. **Stacking** multiple quarters into a single dataset (critical for high determination rates)
3. **Applying mensalization** using the `PNADCperiods` package

If you already have PNADC data and want to learn the package API first, see [Get Started](getting-started.html). For algorithm details, see [How PNADCperiods Works](how-it-works.html).

---

## Prerequisites

### Required Packages

```{r packages}
# Install packages if needed
install.packages(c("PNADcIBGE", "fst"))

# Install PNADCperiods from GitHub
# remotes::install_github("antrologos/PNADCperiods")

# Load packages
library(PNADcIBGE)
library(data.table)
library(fst)
library(PNADCperiods)
```

### System Requirements

- **Disk space**: ~5 GB for 2020-2024 data, ~15 GB for full history (2012-present)
- **RAM**: At least 8 GB recommended; 16 GB for comfortable processing
- **Time**: 2-3 hours for downloading (depends on internet speed), ~5 minutes for processing
- **Internet**: Required for downloading data and for SIDRA API access (weight calibration)

---

## Understanding PNADC Data

PNADC is Brazil's primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations.

**Why stack multiple quarters?** The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve **over 97% determination**.

| Quarters Stacked | Month % | Fortnight % | Week % |
|------------------|---------|-------------|--------|
| 1 (single quarter) | ~70% | ~7% | ~2% |
| 8 (2 years) | ~94% | ~9% | ~3% |
| 20 (5 years) | ~95% | ~8% | ~3% |
| 55+ (full history) | **~97%** | **~9%** | **~3%** |

For most applications, we recommend stacking at least 2 years (8 quarters) of data.

---

## Step 1: Set Up Your Environment

```{r setup-dir}
# Set your data directory (adjust path as needed)
data_dir <- "path/to/your/pnadc_data/"
dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)
```

## Step 2: Define Which Quarters to Download

Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate:

```{r editions}
# Define quarters to download (2020-2024 example)
editions <- expand.grid(
  year = 2020:2024,
  quarter = 1:4
)
# If downloading recent years, filter out quarters not yet available:
# editions <- editions[!(editions$year == 2025 & editions$quarter > 3), ]
```

## Step 3: Download the Data

The download loop fetches each quarter from IBGE and saves it in FST format for fast loading:

```{r download-loop}
for (i in 1:nrow(editions)) {
  year_i <- editions$year[i]
  quarter_i <- editions$quarter[i]

  filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst")
  cat("Downloading:", year_i, "Q", quarter_i, "\n")

  # Download from IBGE
  pnadc_quarter <- get_pnadc(
    year = year_i,
    quarter = quarter_i,
    labels = FALSE,    # IMPORTANT: Use numeric codes, not labels
    deflator = FALSE,
    design = FALSE,
    savedir = data_dir
  )

  # Save as FST format (fast serialization)
  write_fst(pnadc_quarter, file.path(data_dir, filename))

  # Clean up temporary files created by PNADcIBGE
  temp_files <- list.files(data_dir,
                           pattern = "\\.(zip|sas|txt)$",
                           full.names = TRUE)
  file.remove(temp_files)

  rm(pnadc_quarter)
  gc()
}
```

> **Important**: Always use `labels = FALSE` when downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors.

---

## Step 4: Stack the Quarterly Files

Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization:

```{r stack-data}
# Columns needed for mensalization
cols_needed <- c(
  # Time and identifiers
  "Ano", "Trimestre", "UPA", "V1008", "V1014",
  # Birthday variables (for reference period algorithm)
  "V2008", "V20081", "V20082", "V2009",
  # Weight and stratification (for weight calibration)
  "V1028", "UF", "posest", "posest_sxi"
)

# Stack all quarters
files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE)

pnadc_stacked <- rbindlist(lapply(files, function(f) {
  cat("Loading:", basename(f), "\n")
  read_fst(f, columns = cols_needed, as.data.table = TRUE)
}))

cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n")
```

---

## Step 5: Apply Mensalization

Build the crosswalk (identify reference periods) and calibrate weights:

```{r mensalize}
# Step 1: Build crosswalk (identify reference periods)
crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE)

# Check determination rates
crosswalk[, .(
  month_rate = mean(determined_month),
  fortnight_rate = mean(determined_fortnight),
  week_rate = mean(determined_week)
)]

# Step 2: Apply crosswalk and calibrate weights
result <- pnadc_apply_periods(
  data = pnadc_stacked,
  crosswalk = crosswalk,
  weight_var = "V1028",
  anchor = "quarter",
  calibrate = TRUE,
  calibration_unit = "month",
  verbose = TRUE
)
```

The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination.

---

## Step 6: Explore the Results

The result contains all original columns plus reference period indicators and calibrated weights:

```{r explore}
# Key new columns
names(result)[grep("ref_|determined_|weight_", names(result))]

# Distribution of reference months within quarters
result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)]
```

Key output columns:

| Column | Description |
|--------|-------------|
| `ref_month_in_quarter` | Position within quarter (1, 2, or 3; NA if indeterminate) |
| `ref_month_yyyymm` | Reference month as YYYYMM integer (e.g., 202301) |
| `determined_month` | Logical flag (TRUE if month was determined) |
| `weight_monthly` | Calibrated monthly weight (if `calibrate = TRUE`) |

The distribution is approximately equal across months 1, 2, and 3 (each around 31-32%), with the remaining observations having `NA` for indeterminate cases.

---

## Step 7: Save and Use the Results

Save the mensalized data for future use:

```{r save}
write_fst(result, file.path(data_dir, "pnadc_mensalized.fst"))
```

To compute monthly estimates, filter to determined observations and aggregate by `ref_month_yyyymm`:

```{r monthly-analysis}
# Monthly unemployment rate
monthly_unemployment <- result[determined_month == TRUE, .(
  unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) /
                      sum((VD4001 == 1) * weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]

# Monthly population
monthly_pop <- result[determined_month == TRUE, .(
  population = sum(weight_monthly, na.rm = TRUE)
), by = ref_month_yyyymm]
```

For more analysis examples, see [Applied Examples](applied-examples.html).

---

## Memory and Performance Tips

1. **Selective column loading**: Only load the columns you need with `read_fst(..., columns = ...)`. This dramatically reduces memory usage.

2. **Process in batches**: For very large analyses, process one year at a time and combine results.

3. **Use FST format**: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes.

4. **Clean up regularly**: Use `rm()` and `gc()` to free memory after processing each quarter.

### File Size Reference

| Period | Quarters | Observations | FST Size |
|--------|----------|--------------|----------|
| 2020-2024 | 20 | ~8.9M | ~5 GB |
| 2012-2025 | 55 | ~29M | ~15 GB |

---

## Extending to Full History

For the best determination rate and longitudinal analysis, download all available quarters:

```{r full-history}
# Download all available data (2012-present)
editions_full <- expand.grid(
  year = 2012:2025,
  quarter = 1:4
)
editions_full <- editions_full[!(editions_full$year == 2025 &
                                   editions_full$quarter > 3), ]

# Use the same download and stacking workflow as above
```

The full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month).

---

## Troubleshooting

1. **"Column not found" errors**: Ensure you used `labels = FALSE` when downloading. The algorithm requires numeric codes.

2. **Download failures**: IBGE servers can be slow or unavailable. The `PNADcIBGE` package will retry automatically, but you may need to restart interrupted downloads.

3. **Memory errors**: Try processing fewer quarters at a time, or use a machine with more RAM.

4. **SIDRA API errors**: Weight calibration requires internet access to the SIDRA API. If it fails, try again later or use `calibrate = FALSE` for reference period identification without weight calibration.

---

## Next Steps

- Follow the usage patterns in [Get Started](getting-started.html) with your real data
- See analysis examples in [Applied Examples](applied-examples.html)
- Learn about the algorithm in [How PNADCperiods Works](how-it-works.html)

> **Working with annual PNADC data?** Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See [Monthly Poverty Analysis with Annual PNADC Data](annual-poverty-analysis.html) for details on using `pnadc_apply_periods()` with `anchor = "year"`.

---

## References

- HECKSHER, Marcos. "Valor Impreciso por Mes Exato: Microdados e Indicadores Mensais Baseados na Pnad Continua". IPEA - Nota Tecnica Disoc, n. 62. Brasilia, DF: IPEA, 2020. <https://portalantigo.ipea.gov.br/portal/index.php?option=com_content&view=article&id=35453>
- HECKSHER, M. "Cinco meses de perdas de empregos e simulacao de um incentivo a contratacoes". IPEA - Nota Tecnica Disoc, n. 87. Brasilia, DF: IPEA, 2020.
- HECKSHER, Marcos. "Mercado de trabalho: A queda da segunda quinzena de marco, aprofundada em abril". IPEA - Carta de Conjuntura, v. 47, p. 1-6, 2020.
- Barbosa, Rogerio J; Hecksher, Marcos. (2026). PNADCperiods: Identify Reference Periods in Brazil's PNADC Survey Data. R package version v0.1.0. <https://github.com/antrologos/PNADCperiods>