--- title: "Nonparametric Exponential Tilting Theory" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{exptilt_nonparam_theory} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview This vignette explains the matrix-vectorized implementation of the **fully nonparametric** exponential tilting (EXPTILT) estimator, as described in **Appendix 2** of Riddles et al. (2016). This method is designed for **fully categorical data**, where the outcomes ($Y$), the response-model covariates ($X_1$), and the instrumental-variable covariates ($X_2$) are all discrete. Unlike the parametric method, which estimates the parameters $\phi$ of a response probability function $\pi(x, y; \phi)$, the nonparametric approach directly estimates the **nonresponse odds** for each stratum. This is achieved using an **Expectation-Maximization (EM) algorithm** to find the maximum likelihood estimates of these odds. The implementation (`exptilt_nonparam.data.frame`) assumes the input `data` is an **aggregated data frame**, where each row represents a unique stratum $x^* = (x_1, x_2)$ and contains the *counts* of respondents for each outcome $y^*$ and the *total count* of nonrespondents. ## Notation and Main Objects The implementation maps directly to the notation in Appendix 2. Let $x_1$ be the covariates for the response model and $x_2$ be the nonresponse instrumental variable. A full stratum is $x^* = (x_1, x_2)$, and a response-model-only stratum is $x_1$. The algorithm is built from a set of fixed matrices (computed once) and one matrix that is iteratively updated. ### Fixed Objects (Pre-computed) 1. **Respondent Counts $N_{y^*x^*}$** (code: `n_y_x_matrix`) * **Source:** `data[, outcome_cols]` * **Dimensions:** $(N_{\text{strata}} \times K_{\text{outcomes}})$, where $N_{\text{strata}}$ is the number of $(x_1, x_2)$ rows and $K_{\text{outcomes}}$ is the number of $Y$ categories. * **Definition:** The observed (weighted) count of respondents for stratum $x^*$ and outcome $y^*$. $$N_{y^*x^*} = \sum_{i \in A} d_i \delta_i I(y_i = y^*, x_i = x^*)$$ (Note: The implementation assumes $d_i=1$ unless the input `data` is pre-weighted). 2. **Nonrespondent Counts $M_{x^*}$** (code: `m_x_vec`) * **Source:** `data[, refusal_col]` * **Dimensions:** $(N_{\text{strata}} \times 1)$ * **Definition:** The observed (weighted) total count of nonrespondents for stratum $x^*$. $$M_{x^*} = m_{x^*}$$ 3. **Respondent Proportions $\hat{P}_{y^*|x^*}$** (code: `p_hat_matrix`) * **Source:** `n_y_x_matrix / rowSums(n_y_x_matrix)` * **Dimensions:** $(N_{\text{strata}} \times K_{\text{outcomes}})$ * **Definition:** The conditional probability (proportion) of observing outcome $y^*$ given stratum $x^*$, *among respondents*. This is a fixed, observed quantity used in the E-Step. $$\hat{p}_{y^*|x^*} = \frac{N_{y^*x^*}}{\sum_{y} N_{yx^*}} = pr(y=y^* | x_1=x_1^*, x_2=x_2^*, \delta=1)$$ 4. **Aggregated Respondent Counts $N_{y^*x_1^*}$** (code: `n_y_x1_matrix`) * **Source:** `aggregate(n_y_x_matrix ~ data$x1_key, ...)` * **Dimensions:** $(N_{x_1} \times K_{\text{outcomes}})$, where $N_{x_1}$ is the number of unique $x_1$ strata. * **Definition:** The observed (weighted) count of respondents for outcome $y^*$ in stratum $x_1$, summed over the instrument $x_2$. This is the denominator of the M-Step. $$N_{y^*x_1^*} = \sum_{x_2} N_{y^*(x_1^*, x_2)} = n_{y^*x_1^*}$$ ### Iterative Objects (The EM Algorithm) 1. **Odds Matrix $O^{(t)}(x_1, y)$** (code: `odds_matrix`) * **Dimensions:** $(N_{x_1} \times K_{\text{outcomes}})$ * **Definition:** The **parameter** being estimated. It represents the odds of nonresponse for a given $x_1$ stratum and outcome $y$. This is updated in the M-Step. * **Initialization (Step 0):** $O^{(0)}(x_1, y) = 1$ for all $(x_1, y)$. 2. **Expected Nonrespondent Counts $M_{y^*x^*}^{(t)}$** (code: `m_y_x_matrix`) * **Dimensions:** $(N_{\text{strata}} \times K_{\text{outcomes}})$ * **Definition:** The *expected* count of nonrespondents for stratum $x^*$ and outcome $y^*$, given the current odds $O^{(t)}$. This is computed in the E-Step. 3. **Aggregated Expected Nonrespondent Counts $M_{y^*x_1^*}^{(t)}$** (code: `m_y_x1_matrix`) * **Dimensions:** $(N_{x_1} \times K_{\text{outcomes}})$ * **Definition:** The *expected* count of nonrespondents for outcome $y^*$ in stratum $x_1$, summed over the instrument $x_2$. This is the numerator of the M-Step. ## The Expectation-Maximization (EM) Algorithm The function `exptilt_nonparam.data.frame` is a direct implementation of the EM algorithm in Appendix 2. The goal is to find the $\text{argmax} O(x_1, y)$ that maximizes the observed data likelihood, which is solved by iterating two steps. ### Step 1.1 (E-Step): Compute Expected Nonrespondent Counts The E-Step computes the expected breakdown of nonrespondents into outcome categories. It answers: "Given our current `odds_matrix` $O^{(t)}$, what is the expected count $M_{y^*x^*}^{(t)}$ for each full stratum $x^*$?" * **Formula:** $$M_{y^*x^*}^{(t)} = M_{x^*} \frac{\hat{p}_{y^*|x^*} O^{(t)}(x_1^*, y^*)}{\sum_{y} \hat{p}_{y|x^*} O^{(t)}(x_1^*, y)}$$ * **Implementation (`E-STEP` in code):** This is vectorized. The denominator is computed via `rowSums(p_hat_matrix * odds_joined_matrix)`. The full calculation is: `m_y_x_matrix <- m_x_vec * (p_hat_matrix * odds_joined_matrix) / denominator` ### Step 1.2 (M-Step): Update Odds Matrix The M-Step updates the nonresponse odds $O(x_1, y)$ using the expected counts from the E-Step. 1. **Aggregate Expected Counts:** First, the expected nonrespondent counts $M_{y^*x^*}^{(t)}$ are aggregated over the instrument $x_2$ to get the total expected nonrespondents $M_{y^*x_1^*}^{(t)}$ for each $(x_1, y)$ cell. * **Formula:** $M_{y^*x_1^*}^{(t)} = \sum_{x_2} M_{y^*(x_1^*, x_2)}^{(t)}$ * **Implementation:** `m_y_x1_matrix <- aggregate(m_y_x_matrix ~ data$x1_key, ...)` 2. **Update Odds:** The new odds $O^{(t+1)}$ is the simple ratio of the *total expected nonrespondents* to the *total observed respondents* for each $(x_1, y)$ cell. * **Formula:** $$O^{(t+1)}(x_1^*, y^*) = \frac{M_{y^*x_1^*}^{(t)}}{N_{y^*x_1^*}}$$ * **Implementation:** `odds_matrix <- m_y_x1_matrix / n_safe` ### Step 2: Convergence Steps 1.1 and 1.2 are repeated until the change in the `odds_matrix` (measured by the sum of absolute differences) falls below the tolerance `tol`. ## Final Estimates and Survey Weights ### Survey Weights The `exptilt_nonparam.data.frame` function assumes the input `data` is **already aggregated**. If a `survey.design` object is provided to `exptilt_nonparam.survey.design`, an adapter (not shown here) must first be used to create this aggregated table from the microdata. In that case, $N_{y^*x^*}$ and $M_{x^*}$ represent **weighted sums of counts** ($\sum d_i \dots$), not simple counts. The EM algorithm and all derived matrices (`p_hat_matrix`, `odds_matrix`) are then correctly computed based on these weighted inputs, following the logic of the paper. ### Final Adjusted Counts The final output `data_to_return` is the object of primary interest for analysis. It is constructed by: 1. Calculating the final expected nonrespondent counts $M_{y^*x^*}^{(\text{final})}$ using the converged `odds_matrix`. 2. Adding these expected counts to the original observed respondent counts $N_{y^*x^*}$. * **Definition:** `data_to_return[y^*, x^*] = N_{y^*x^*} + M_{y^*x^*}^{(\text{final})}` * **Implementation:** `data_to_return[, outcome_cols] <- n_y_x_matrix_ordered + m_y_x_matrix_ordered` This final adjusted table represents the completed dataset, where the "Refusal" counts have been redistributed across the outcome columns according to the NMAR model. ### Final Proportion $\hat{\theta}_j$ The final population proportion for an outcome $j$, $\hat{\theta}_j$, is the total (weighted) adjusted count for outcome $j$ divided by the total (weighted) population. This is calculated *from* the `data_to_return` object. * **Formula (Unweighted):** $$\hat{\theta}_j = \frac{\sum_{x^*} (N_{jx^*} + M_{jx^*}^{(\text{final})})}{\sum_y \sum_{x^*} (N_{yx^*} + M_{yx^*}^{(\text{final})})}$$ * **Implementation:** This is precisely what the "Adjusted Proportions" calculation in your notebook's Chunk 5 performs on the `data_to_return` object. This differs from the paper's Step 3 formula, which is an IPW (Inverse Probability Weighting) estimator. However, both methods are asymptotically equivalent, and the "add-and-sum" method (your `data_to_return`) is a more direct and intuitive application of the EM algorithm's goal.