--- title: "Empirical Likelihood" description: | Jing Qin, Denis Leung, Jun Shao
Estimation With Survey Data Under Nonignorable Nonresponse or Informative Sampling
https://doi.org/10.1198/016214502753479338 output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Empirical Likelihood} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview This vignette demonstrates the empirical likelihood (EL) estimator for Not Missing at Random (NMAR) data in the NMAR package. The primary estimand is the full-data mean of the outcome \(Y\) under the QLS NMAR model. The method implements the estimator of Qin, Leung, and Shao (2002), using empirical likelihood weights that satisfy estimating equations for the response mechanism and (optionally) auxiliary moment constraints. For full derivations, the analytic Jacobian, and variance discussion (bootstrap), see the companion article "Empirical Likelihood Theory for NMAR". Key features: - Supports `data.frame` (IID) and `survey.design` objects via the same `nmar()` API. - Variance via bootstrap (IID resampling or survey replicate weights). - Optional standardization of predictors; weight trimming for robustness. - Rich S3 surface: `summary()`, `confint()`, `tidy()`, `glance()`, `weights()`, `fitted()`. ## Quick start - Specify the model with a two-sided formula: `Y_miss ~ auxiliaries | response_predictors`. - The response model always includes an intercept and the evaluated LHS outcome expression \(Y\) for respondents. - Variables on the first RHS (left of `|`) enter only the auxiliary constraints (not the response model). - Variables on the second RHS (right of `|`) enter only the response model (not the auxiliary constraints). - Variables on the outcome RHS (e.g., `X1 + X2`) are auxiliaries; supply their known population means via `auxiliary_means = c(X1 = ..., X2 = ...)`. - Alternatively, set `auxiliary_means = NULL` to estimate auxiliary means from the full input (unweighted for IID data; design-weighted for surveys), corresponding to the QLS case that uses \(\bar X\) when \(X\) is observed for all sampled units. - Predictors to the right of `|` enter only the response model (no auxiliary constraint) and do not need population means. - If you want a covariate to enter both the auxiliary constraints and the response model (as in the original QLS setup), include it on both sides, for example: `Y_miss ~ X | X`. - Choose the engine: `el_engine(...)`, e.g., `el_engine(auxiliary_means = c(X1 = 0), variance_method = "bootstrap", standardize = TRUE)`. - Fit: `nmar(formula = Y_miss ~ X1 + X2 | Z1 + Z2, data = df_or_design, engine = el_engine(...))`. - Inspect: `summary()`, `confint()`, `weights()`, `fitted()`, and `fit$diagnostics`. # Data-frame example (IID) We simulate an NMAR mechanism where the response probability depends on the unobserved outcome. ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` ```{r} set.seed(123) library(NMAR) N <- 500 X <- rnorm(N) eps <- rnorm(N) Y <- 2 + 0.5 * X + eps # NMAR response: depends on Y p <- plogis(-1.0 + 0.4 * scale(Y)[, 1]) R <- runif(N) < p dat <- data.frame(Y_miss = Y, X = X) dat$Y_miss[!R] <- NA_real_ ``` ```{r} engine <- el_engine(auxiliary_means = c(X = 0), variance_method = "none", standardize = TRUE) # Fit EL estimator (no variance for speed in vignette) fit <- nmar( formula = Y_miss ~ X, data = dat, engine = engine ) summary(fit) # For confidence intervals, use bootstrap variance (see example below). ``` Probit family (optional): ```{r} engine <- el_engine(auxiliary_means = c(X = 0), family = "probit", variance_method = "none", standardize = TRUE) fit_probit <- nmar( formula = Y_miss ~ X, engine = engine, data = dat ) summary(fit_probit) ``` Tidy/glance summaries (via the `generics` package): ```{r} generics::tidy(fit) generics::glance(fit) ``` Outputs and diagnostics at a glance (probability-scale weights sum to 1; population-scale weights sum to the analysis total \(N_\text{pop}\)): ```{r} head(weights(fit), 10) head(weights(fit, scale = "population"), 10) head(fitted(fit), 10) fit$diagnostics[c( "max_equation_residual", "jacobian_condition_number", "trimmed_fraction", "min_denominator", "fraction_small_denominators" )] ``` Bootstrap variance (keep reps small for speed). This example requires the optional future.apply package; the chunk is skipped if it is not installed: ```{r, eval = requireNamespace("future.apply", quietly = TRUE)} set.seed(123) engine <- el_engine( auxiliary_means = c(X = 0), variance_method = "bootstrap", bootstrap_reps = 10, standardize = TRUE ) fit_boot <- nmar( formula = Y_miss ~ X, engine = engine, data = dat ) se(fit_boot) confint(fit_boot) ``` # Respondents-only data (n_total) If you pass respondents-only data (the outcome contains no NA), provide the total sample size via `n_total` in the engine so the estimator can recover the population response rate: ```{r} set.seed(124) N <- 300 X <- rnorm(N); eps <- rnorm(N); Y <- 1.5 + 0.4 * X + eps p <- plogis(-0.5 + 0.4 * scale(Y)[, 1]) R <- runif(N) < p df_resp <- subset(data.frame(Y_miss = Y, X = X), R == 1) eng_resp <- el_engine(auxiliary_means = c(X = 0), variance_method = "none", n_total = N) fit_resp <- nmar(Y_miss ~ X, data = df_resp, engine = eng_resp) summary(fit_resp) ``` # Response-only predictors You can include predictors that enter only the response model (and are not constrained as auxiliaries) by placing them to the right of `|` in the formula. ```{r} set.seed(125) N <- 400 X <- rnorm(N) Z <- rnorm(N) Y <- 1 + 0.6 * X + 0.3 * Z + rnorm(N) p <- plogis(-0.6 + 0.5 * scale(Y)[, 1] + 0.4 * Z) R <- runif(N) < p df2 <- data.frame(Y_miss = Y, X = X, Z = Z) df2$Y_miss[!R] <- NA_real_ engine <- el_engine(auxiliary_means = c(X = 0), variance_method = "none", standardize = TRUE) # Use X as auxiliary (known population mean 0), and Z as response-only predictor fit_resp_only <- nmar( formula = Y_miss ~ X | Z, data = df2, engine = engine ) summary(fit_resp_only) ``` Auxiliary means and formulas: - Names of `auxiliary_means` must match the `model.matrix` columns generated by the outcome RHS (for numeric variables this typically equals the variable names; for factors it corresponds to the indicator columns). - When `standardize = TRUE`, the engine automatically transforms `auxiliary_means` to the standardized scale internally and reports coefficients on the original scale. - Response-only predictors (to the right of `|`) do not need auxiliary means. # Survey design example (optional) The estimator supports complex surveys via `survey::svydesign()`. This chunk runs only if the `survey` package is available. If the survey weights were normalized (for example, rescaled to sum to the sample size), pass an appropriate population total via `el_engine(n_total = ...)` so that population-scale EL weights are on the intended scale. ```{r, eval = requireNamespace("survey", quietly = TRUE)} suppressPackageStartupMessages(library(survey)) data(api, package = "survey") set.seed(42) apiclus1$api00_miss <- apiclus1$api00 ystd <- scale(apiclus1$api00)[, 1] prob <- plogis(-0.5 + 0.4 * ystd + 0.2 * scale(apiclus1$ell)[, 1]) miss <- runif(nrow(apiclus1)) > prob apiclus1$api00_miss[miss] <- NA_real_ dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc) # Let the engine infer auxiliary means from the full design (design-weighted). # Alternatively, you can supply known population means via auxiliary_means. engine <- el_engine(auxiliary_means = NULL, variance_method = "none", standardize = TRUE) fit_svy <- nmar( formula = api00_miss ~ ell | ell, data = dclus1, engine = engine ) summary(fit_svy) ``` # Practical guidance - Trimming: Use a finite `trim_cap` to improve robustness when large weights occur; prefer bootstrap variance when trimming. - Solver control: set `control = list(xtol = 1e-10, ftol = 1e-10, maxit = 200)` for tighter tolerances if needed. Globalization details are managed internally by `nleqslv`. - Standardization: `standardize = TRUE` typically improves numerical stability and comparability across predictors and auxiliary means. - Diagnostics: Inspect `fit$diagnostics` (Jacobian condition number, max equation residuals, trimming fraction) to assess numerical health and identification strength. - Response-only predictors: Variables to the right of `|` do not need to appear on the RHS of the outcome formula; they enter only the response model. Auxiliary means must be supplied only for variables on the outcome RHS. - Inconsistent auxiliaries: If provided auxiliary means are grossly inconsistent with the respondent support, the engine will issue a warning via auxiliary inconsistency diagnostics and the solver may fail or yield highly concentrated weights. Consider revisiting the constraints, relaxing them, or using `trim_cap` and bootstrap variance. Troubleshooting: - Extreme weights: set a finite `trim_cap`; prefer `variance_method = "bootstrap"` for SE. - Ill-conditioned Jacobian (large `fit$diagnostics$jacobian_condition_number`): prefer `variance_method = "bootstrap"`. You may also tighten solver tolerances via `control = list(xtol=..., ftol=..., maxit=...)`. - Convergence issues: check `fit$diagnostics$max_equation_residual`, rescale predictors (`standardize = TRUE`), or reduce the number of constraints. # Solver control and notes - Control example: increase iterations and tighten tolerances via `control` (passed to `nleqslv`): ```{r} ctrl <- list(maxit = 200, xtol = 1e-10, ftol = 1e-10) eng_ctrl <- el_engine(auxiliary_means = c(X = 0), variance_method = "none", control = ctrl) invisible(nmar(Y_miss ~ X, data = dat, engine = eng_ctrl)) ``` # References and further reading - Qin, J., Leung, D., and Shao, J. (2002). Estimation with survey data under nonignorable nonresponse or informative sampling. Journal of the American Statistical Association, 97(457), 193-200. doi:10.1198/016214502753479338 - Chen, J. and Sitter, R. R. (1999). A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statistica Sinica, 9, 385-406. - Wu, C. (2005). Algorithms and R codes for the pseudo empirical likelihood method in survey sampling. Survey Methodology, 31(2), 239-243. # Families and numerical stability - Family: `el_engine(family = "logit")` (default) or `family = "probit"`. - Probit stability: the response-model score is evaluated as the Mills ratio $\phi(\eta)/\Phi(\eta)$ in the log domain for numerical stability; the logit score simplifies to $1-\mathrm{plogis}(\eta)$. We also cap the linear predictor and clip probabilities used in ratios. - Theory mapping: see the companion article "Empirical Likelihood Theory for NMAR" for equations, Jacobian blocks, and variance details. ```{r} sessionInfo() ```