--- title: "Introduction to the heaping Package" author: "Matthias Templ" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the heaping Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Overview The **heaping** package provides tools for detecting and correcting heaping (digit preference) in survey data at the individual record level. Heaping occurs when respondents disproportionately report values at round numbers---for example, ages ending in 0 or 5, incomes rounded to the nearest thousand, or weights rounded to the nearest 5 pounds. The package provides: 1. **Heaping indices** for detecting and measuring heaping severity 2. **Correction methods** for removing heaping while preserving individual-level data 3. **Model-based correction** that preserves relationships with covariates ```{r load-package} library(heaping) ``` ## The Problem: What is Heaping? Heaping is a common data quality issue where certain values appear more frequently than expected due to rounding or digit preference. This is especially common with self-reported data. Let's create some example data to illustrate, using the same distribution parameters as in Templ (2024): ```{r create-example-data} # Generate realistic age data using a log-normal distribution # These parameters produce a realistic age distribution set.seed(123) age_true <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772) age_true <- round(age_true[age_true < 93]) # Create heaped version where 27% of respondents round their age to multiples of 5 # This heap_ratio is typical for survey data with moderate heaping heap_ratio <- 0.27 n <- length(age_true) indices_to_heap <- sample(n, round(heap_ratio * n)) age_heaped <- age_true age_heaped[indices_to_heap] <- round(age_heaped[indices_to_heap] / 5) * 5 ``` Let's visualize the difference: ```{r visualize-heaping, fig.cap="Comparison of true ages vs heaped ages"} oldpar <- par(mfrow = c(1, 2)) # Define breaks to cover the full range age_breaks <- seq(0, max(age_true, age_heaped) + 1, by = 1) # True ages hist(age_true, breaks = age_breaks, col = "steelblue", main = "True Ages (No Heaping)", xlab = "Age", border = "white") # Heaped ages hist(age_heaped, breaks = age_breaks, col = "coral", main = "Heaped Ages", xlab = "Age", border = "white") par(oldpar) ``` Notice the spikes at ages ending in 0 and 5 in the heaped data. ## Detecting Heaping: Heaping Indices Before correcting heaping, we need to detect and quantify it. The package provides several established indices. ### Whipple's Index The most commonly used index. It compares the frequency of ages ending in 0 or 5 to the expected frequency. ```{r whipple-index} # Standard Whipple index (100 = no heaping, 500 = maximum heaping) whipple(age_true, method = "standard") whipple(age_heaped, method = "standard") # Modified Whipple index (0 = no heaping, 1 = maximum heaping) whipple(age_true, method = "modified") whipple(age_heaped, method = "modified") ``` ### Myers' Index Myers' blended index measures preferences for each terminal digit (0-9): ```{r myers-index} # Myers' index (0 = no heaping, 90 = maximum) myers(age_true) myers(age_heaped) ``` ### Bachi's Index Similar to Myers' but uses a different blending technique: ```{r bachi-index} # Bachi's index (0 = no heaping, 90 = maximum) bachi(age_true) bachi(age_heaped) ``` ### Noumbissi's Index Evaluates heaping for a specific terminal digit: ```{r noumbissi-index} # Noumbissi's index for digit 0 (1.0 = no heaping) noumbissi(age_true, digit = 0) noumbissi(age_heaped, digit = 0) # Noumbissi's index for digit 5 noumbissi(age_true, digit = 5) noumbissi(age_heaped, digit = 5) ``` ### Convenience Function: All Indices at Once ```{r all-indices} # Calculate all indices indices_true <- heaping_indices(age_true) indices_heaped <- heaping_indices(age_heaped) # Compare data.frame( Index = names(indices_true), True = round(unlist(indices_true), 3), Heaped = round(unlist(indices_heaped), 3) ) ``` ### Using Sampling Weights All indices support sampling weights for survey data: ```{r weighted-indices} # Create example weights weights <- runif(length(age_heaped), 0.5, 2) # Weighted Whipple index whipple(age_heaped, weight = weights) ``` ## Correcting Heaping The key innovation of this package is correcting heaping at the **individual level**. Traditional methods only smooth aggregated frequency distributions, which destroys individual records needed for regression analysis. ### Basic Correction The `correctHeaps()` function corrects heaping by replacing a proportion of heaped values with draws from truncated distributions: ```{r basic-correction} # Correct heaping using log-normal distribution age_corrected <- correctHeaps(age_heaped, heaps = "5year", # heaps at 0, 5, 10, 15, ... method = "lnorm", seed = 42) # for reproducibility # Compare Whipple indices c(Original = whipple(age_true), Heaped = whipple(age_heaped), Corrected = whipple(age_corrected)) ``` Let's visualize the correction: ```{r visualize-correction, fig.cap="Before and after heaping correction"} oldpar <- par(mfrow = c(1, 2)) hist(age_heaped, breaks = age_breaks, col = "coral", main = "Before Correction", xlab = "Age", border = "white") hist(age_corrected, breaks = age_breaks, col = "forestgreen", main = "After Correction", xlab = "Age", border = "white") par(oldpar) ``` ### Correction Methods Four distribution methods are available: ```{r correction-methods} # Log-normal (default) - good for right-skewed data like age age_lnorm <- correctHeaps(age_heaped, method = "lnorm", seed = 42) # Normal - good for symmetric data age_norm <- correctHeaps(age_heaped, method = "norm", seed = 42) # Uniform - simplest approach, baseline comparison age_unif <- correctHeaps(age_heaped, method = "unif", seed = 42) # Kernel - nonparametric, adapts to local data shape age_kernel <- correctHeaps(age_heaped, method = "kernel", seed = 42) # Compare results data.frame( Method = c("Original", "Heaped", "Log-normal", "Normal", "Uniform", "Kernel"), Whipple = c(whipple(age_true), whipple(age_heaped), whipple(age_lnorm), whipple(age_norm), whipple(age_unif), whipple(age_kernel)) ) ``` ### Verbose Output and Diagnostics Use `verbose = TRUE` to get detailed information about the correction: ```{r verbose-output} result <- correctHeaps(age_heaped, heaps = "5year", method = "lnorm", seed = 42, verbose = TRUE) # Number of values changed result$n_changed # Changes by heap age head(result$changes_by_heap, 10) # Heaping ratios (ratio > 1 indicates heaping) head(result$ratios, 10) ``` ### 10-Year Heaping For data with heaping only at ages ending in 0: ```{r 10year-heaping} # Create data with 10-year heaping age_heap10 <- age_true heapers10 <- sample(c(TRUE, FALSE), length(age_true), replace = TRUE, prob = c(0.3, 0.7)) age_heap10[heapers10] <- round(age_heap10[heapers10] / 10) * 10 # Correct 10-year heaping age_corrected10 <- correctHeaps(age_heap10, heaps = "10year", method = "lnorm", seed = 42) c(Heaped = whipple(age_heap10), Corrected = whipple(age_corrected10)) ``` ### Custom Heap Positions For non-standard heaping patterns, specify exact positions: ```{r custom-heaps} # Heaping at specific ages (e.g., 18, 21, 30, 40, 50, 65) custom_positions <- c(18, 21, 30, 40, 50, 65) age_custom <- correctHeaps(age_heaped, heaps = custom_positions, method = "lnorm", seed = 42) ``` ### Single Heap Correction To correct heaping at just one specific age: ```{r single-heap} # Add artificial heap at age 40 age_with_40heap <- c(age_true, rep(40, 500)) # Correct only the heap at 40 age_fixed_40 <- correctSingleHeap(age_with_40heap, heap = 40, before = 3, # range: 37-43 after = 3, method = "lnorm", seed = 42) # Check the counts at age 40 c(Before = sum(age_with_40heap == 40), After = sum(age_fixed_40 == 40)) ``` ## Model-Based Correction When age is correlated with other variables (income, education, etc.), simple correction can distort these relationships. Model-based correction uses covariates to make smarter corrections. ```{r model-based-setup, message=FALSE} # Create a dataset with correlated variables following the paper's approach set.seed(123) # Generate age from log-normal distribution age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772) age <- round(age[age < 93 & age >= 18]) n <- length(age) # Simulate covariates correlated with age age_scaled <- scale(age) # Income as a function of age with noise income <- exp(3 + 0.5 * age_scaled + rnorm(n, 0, 0.5)) # Marital status influenced by age marital_status <- ifelse(plogis(-3 + 0.05 * age_scaled + rnorm(n, 0, 0.5)) > 0.5, "married", "single") # Education level with age-dependent probabilities get_education_prob <- function(a) { if (a < 25) return(c(0.5, 0.35, 0.15)) else if (a < 40) return(c(0.3, 0.45, 0.25)) else return(c(0.2, 0.4, 0.4)) } education <- sapply(age, function(a) { sample(c("High School", "Bachelor", "Master"), 1, prob = get_education_prob(a)) }) data_example <- data.frame( age = age, income = income, marital_status = marital_status, education = education ) # Introduce heaping with 27% heap ratio (as in the paper) heap_ratio <- 0.27 indices_to_heap <- sample(n, round(heap_ratio * n)) data_example$age_heaped <- data_example$age data_example$age_heaped[indices_to_heap] <- round(data_example$age_heaped[indices_to_heap] / 5) * 5 ``` Apply model-based correction: ```{r model-based-correction, message=FALSE, warning=FALSE} # Model-based correction using income, education and marital status if (requireNamespace("ranger", quietly = TRUE) && requireNamespace("VIM", quietly = TRUE)) { # Also apply simple correction for comparison data_example$age_simple <- correctHeaps( data_example$age_heaped, heaps = "5year", method = "lnorm", seed = 42 ) # Model-based correction using covariates data_example$age_corrected <- correctHeaps( data_example$age_heaped, heaps = "5year", method = "lnorm", model = age_heaped ~ income + marital_status + education, dataModel = data_example, seed = 42 ) # Compare correlations with log(income) log_income <- log(data_example$income) cor_original <- cor(data_example$age, log_income) cor_heaped <- cor(data_example$age_heaped, log_income) cor_simple <- cor(data_example$age_simple, log_income) cor_corrected <- cor(data_example$age_corrected, log_income) cat("Correlation of age with log(income):\n") cat(" Original (true):", round(cor_original, 4), "\n") cat(" Heaped:", round(cor_heaped, 4), "\n") cat(" Simple correction:", round(cor_simple, 4), "\n") cat(" Model-based correction:", round(cor_corrected, 4), "\n") } ``` Let's visualize the model-based correction: ```{r model-based-visualization, fig.cap="Comparison of heaped ages and model-based corrected ages", fig.height=6} if (requireNamespace("ranger", quietly = TRUE) && requireNamespace("VIM", quietly = TRUE)) { oldpar <- par(mfrow = c(2, 2)) # Age distributions age_breaks <- seq(min(data_example$age) - 1, max(data_example$age) + 1, by = 1) hist(data_example$age_heaped, breaks = age_breaks, col = "coral", main = "Heaped Ages", xlab = "Age", border = "white") hist(data_example$age_corrected, breaks = age_breaks, col = "forestgreen", main = "Model-Based Corrected Ages", xlab = "Age", border = "white") # Age vs log(Income) relationships log_income <- log(data_example$income) plot(data_example$age_heaped, log_income, pch = 16, col = adjustcolor("coral", 0.3), cex = 0.5, main = "Heaped: Age vs log(Income)", xlab = "Age", ylab = "log(Income)") abline(lm(log_income ~ data_example$age_heaped), col = "darkred", lwd = 2) plot(data_example$age_corrected, log_income, pch = 16, col = adjustcolor("forestgreen", 0.3), cex = 0.5, main = "Corrected: Age vs log(Income)", xlab = "Age", ylab = "log(Income)") abline(lm(log_income ~ data_example$age_corrected), col = "darkgreen", lwd = 2) par(oldpar) } ``` The model-based correction preserves the relationship between age and income better than simple correction because it uses the covariate information to determine the direction of adjustments. ## Multiple Imputation Framework Because correction involves random sampling, you can create multiple corrected datasets for proper uncertainty quantification: ```{r multiple-imputation} # Create 5 corrected datasets with different seeds m <- 5 corrected_datasets <- lapply(1:m, function(i) { correctHeaps(age_heaped, heaps = "5year", method = "lnorm", seed = i * 100) }) # Calculate Whipple index for each whipple_values <- sapply(corrected_datasets, whipple) cat("Whipple indices across imputations:\n") cat(" Mean:", round(mean(whipple_values), 2), "\n") cat(" SD:", round(sd(whipple_values), 2), "\n") cat(" Range:", round(min(whipple_values), 2), "-", round(max(whipple_values), 2), "\n") ``` ## Protecting Specific Observations If some ages are known to be accurate (e.g., verified from administrative records), exclude them from correction: ```{r fixed-observations} # Assume first 100 observations are verified verified_indices <- 1:100 age_protected <- correctHeaps(age_heaped, heaps = "5year", method = "lnorm", fixed = verified_indices, # don't change these seed = 42) # Verify protected observations unchanged all(age_heaped[verified_indices] == age_protected[verified_indices]) ``` ## Working with Aggregated Data: Sprague Multipliers If you have 5-year age group data and need single-year ages, use the Sprague multipliers: ```{r sprague-example} # Example: population counts in 5-year groups pop_5year <- c( 1971990, 2095820, 2157190, 2094110, 2116580, # 0-4, 5-9, ..., 20-24 2003840, 1785690, 1502990, 1214170, 796934, # 25-29, ..., 45-49 627551, 530305, 488014, 364498, 259029, # 50-54, ..., 70-74 158047, 125941 # 75-79, 80+ ) # Disaggregate to single years pop_single <- sprague(pop_5year) # First 15 ages head(pop_single, 15) # Total is preserved c(Sum_5year = sum(pop_5year), Sum_single = sum(pop_single)) ``` ## Indices for Old Ages For data quality assessment at older ages, specialized indices are available: ```{r old-age-indices} # Create old-age data with heaping set.seed(42) old_ages <- c(sample(85:105, 3000, replace = TRUE), rep(c(90, 95, 100), each = 200)) # heaping at round ages # Coale-Li index (designed for ages 60+) coale_li(old_ages, digit = 0, ageMin = 85) # Jdanov index (for very old ages like 95, 100, 105) jdanov(old_ages, Agei = c(95, 100, 105)) # Kannisto index (for a single old age) kannisto(old_ages, Agei = 90) kannisto(old_ages, Agei = 95) ``` ## Summary The **heaping** package provides a complete toolkit for handling heaping in survey data: | Function | Purpose | |----------|---------| | `whipple()` | Standard and modified Whipple index | | `myers()` | Myers' blended index | | `bachi()` | Bachi's index | | `noumbissi()` | Single-digit Noumbissi index | | `spoorenberg()` | Total Modified Whipple index | | `coale_li()` | Coale-Li index for old ages | | `jdanov()` | Jdanov index for very old ages | | `kannisto()` | Kannisto index for single old age | | `heaping_indices()` | All common indices at once | | `correctHeaps()` | Correct regular heaping patterns | | `correctSingleHeap()` | Correct a single heap | | `sprague()` | Disaggregate 5-year to single-year ages | Key advantages over traditional methods: 1. **Preserves individual records** for regression and machine learning 2. **Multiple correction methods** (log-normal, normal, uniform, kernel) 3. **Model-based correction** preserves covariate relationships 4. **Supports sampling weights** for survey data 5. **Multiple imputation compatible** for uncertainty quantification ## References - Myers, R. J. (1940). Errors and bias in the reporting of ages in census data. *Transactions of the Actuarial Society of America*, 41, 395-415. - Spoorenberg, T. and Dutreuilh, C. (2007). Quality of age reporting: extension and application of the modified Whipple's index. *Population*, 62(4), 729-741. - Sprague, T. B. (1880). Explanation of a new formula for interpolation. *Journal of the Institute of Actuaries*, 22, 270-285.