---
title: "Introduction to healthbR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to healthbR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Overview

The healthbR package provides easy access to Brazilian public health survey data directly from R. It downloads, caches, and processes data from official Ministry of Health sources, returning clean, analysis-ready tibbles that follow tidyverse conventions.

Currently, healthbR supports **VIGITEL** (Vigilância de Fatores de Risco e Proteção para Doenças Crônicas por Inquérito Telefônico), a telephone-based survey that monitors risk and protective factors for chronic diseases in Brazilian state capitals.

## Getting started

```{r setup}
library(healthbR)
library(dplyr)
```

### Check available data

Before downloading data, you can check which years are available:

```{r}
vigitel_years()
```

### Download and load data

The main function for accessing VIGITEL data is `vigitel_data()`:

```{r}
# load a single year
df <- vigitel_data(2023)

# load multiple years
df <- vigitel_data(2021:2023)
```

Data is automatically cached locally, so subsequent calls for the same year load instantly without re-downloading.

## Understanding the data

### Variable dictionary

VIGITEL uses coded variable names (q6, q8, etc.). Use the dictionary to understand what each variable represents:

```{r}
dict <- vigitel_dictionary()
dict
```


You can search for specific variables:

```{r}
# find weight-related variables
dict |>
  filter(stringr::str_detect(variable_name, "peso"))

# find diabetes-related variables
dict |>
  filter(stringr::str_detect(variable_name, "diab"))
```

### List variables for a specific year

Variables may change between survey years. Check which variables are available:
 
```{r}
vigitel_variables(2023)
```

## Survey analysis

VIGITEL uses complex survey sampling with post-stratification weights. For proper statistical inference, always use the `pesorake` weight variable.

### Using srvyr for weighted analysis

```{r}
library(srvyr)

# create survey design object
vigitel_svy <- df |>
  as_survey_design(weights = pesorake)

# calculate weighted prevalence of diabetes by city
vigitel_svy |>
  group_by(cidade) |>
  summarize(
    prevalence = survey_mean(diab == 1, na.rm = TRUE),
    n = unweighted(n())
  )
```

### Key variables

Some commonly used variables in VIGITEL:
 
| Variable | Description |
|----------|-------------|
| `cidade` | City code (1-27 for state capitals) |
| `q6` | Sex |
| `q8_anos` | Age in years |
| `pesorake` | Post-stratification weight |
| `diab` | Diabetes diagnosis |
| `hart` | Hypertension diagnosis |
| `fumante` | Current smoker |
| `imc` | Body Mass Index |
| `obesid` | Obesity indicator |

Consult `vigitel_dictionary()` for the complete list.

## Performance optimization

healthbR offers three strategies for working with large datasets efficiently.

### 1
. Parquet conversion

Convert Excel files to Parquet format for dramatically faster loading (10-20x improvement):

```{r}
# one-time conversion
vigitel_convert_to_parquet(2015:2023)

# subsequent loads use parquet automatically
df <- vigitel_data(2015:2023)
```

### 2. Parallel downloads

When downloading multiple years, healthbR automatically uses parallel processing if the `furrr` package is available:

```{r}
# downloads happen in parallel (2-4 workers)
df <- vigitel_data(2015:2023)
```

### 3. Lazy evaluation with Arrow

For very large datasets, use lazy evaluation to filter and select data before loading into memory:

```{r}
# returns Arrow Dataset (not loaded into RAM)
df_lazy <- vigitel_data(2015:2023, lazy = TRUE)

# operations are executed lazily
result <- df_lazy |>
  filter(cidade == 1, q8_anos >= 18) |>
  select(q6, q8_anos, pesorake, diab, hart, imc) |>
  collect()  
# only now data is loaded
```

This approach is especially useful when you only need a subset of the data.

## Workflow example

Here's a complete workflow for analyzing diabetes prevalence:

```{r}
library(healthbR)
library(dplyr)
library(srvyr)

# 1. load data
df <- vigitel_data(2023)

# 2. create survey design
svy <- df |>
  as_survey_design(weights = pesorake)

# 3. calculate prevalence by sex
diabetes_by_sex <- svy |>
  group_by(q6) |>
  summarize(
    prevalence = survey_mean(diab == 1, na.rm = TRUE, vartype = "ci"),
    n = unweighted(n())
  )

diabetes_by_sex
```

## Additional resources

- [VIGITEL official page](https://www.gov.br/saude/pt-br/composicao/svsa/inqueritos-de-saude/vigitel)
- [VIGITEL methodology](https://svs.aids.gov.br/download/Vigitel/)
- [srvyr package documentation](https://cran.r-project.org/package=srvyr)