---
title: "Exponential Tilting Theory"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{EXPTILT: vectorized matrix method}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Overview

This vignette explains the matrix-vectorized implementation of the exponential tilting (EXPTILT) estimator used in this package. The exposition focuses on concepts and linear-algebraic operations used in the implementation (functions such as `generate_conditional_density_matrix`, `generate_C_matrix`, `generate_Odds`, and the vectorized `s_function`), and mirrors the statistical equations that underlie the algorithm (see Riddles et al. (2016) for the theoretical background).

The method operates on a dataset split into respondents and non-respondents. Let $$n = n_0 + n_1$$ where $n_1$ is the number of respondents (observed $Y$) and $n_0$ is the number of non-respondents (missing $Y$). The algorithm always treats these two groups separately: the conditional-density model is fit on respondents and used to impute support for non-respondents; the missingness model is fit using covariates that include the candidate $y$ values.

We emphasize two distinct and disjoint sets of covariates throughout, and we require they have empty intersection: $$\mathcal{X}_{\text{outcome}} \cap \mathcal{X}_{\text{missingness}} = \varnothing.$$

-   covariates_for_outcome (denote $X_{\text{out}}$): variables used to model the conditional density $f_1(y\mid X_{\text{out}})$ on respondents;
-   covariates_for_missingness (denote $X_{\text{miss}}$): variables used in the response probability $\pi(X_{\text{miss}},y;\phi)$ (this includes candidate $y$ values when evaluating the missing-data expectation).

Distinguishing these sets clearly in notation and code is crucial: the conditional density model is fit using `covariates_for_outcome` on respondents only, while the missingness model uses `covariates_for_missingness` and (importantly) takes $y$ as an additional predictor when forming $\pi(\cdot;\phi)$.

## Notation and main objects

Let:

-   $n$ be the total number of units; split them into respondents (observed $Y$) and non-respondents (missing $Y$). Denote indices with $i$ for non-respondents and $k$ (or $j$) for respondent outcomes used as support.
-   $\{y_j\}_{j=1}^{n_1}$ be the vector of observed outcome values (respondents).
-   $X^{\text{out}}$ denote the matrix of covariates_for_outcome for respondents.
-   $X^{\text{un}}$ denote the matrix of covariates_for_outcome for non-respondents.
-   $X^{\text{miss,obs}}$ and $X^{\text{miss,un}}$ denote the covariates_for_missingness for respondents and non-respondents, respectively (we shorten these to $X_{\text{miss}}^{\text{obs}}$ and $X_{\text{miss}}^{\text{un}}$ when needed).
-   $\pi(x_{\text{miss}},y;\phi)$ the response probability (no explicit link notation here): for a given covariate row used in the missingness model and a candidate $y$, $\pi(\cdot;\phi)$ denotes the probability of response under parameter $\phi$.

We build three central matrices:

1.  Conditional-density matrix $F$ (denoted in code as `f_matrix_nieobs`):

    -   Size: $n_0 \times n_1$ where $n_0$ is the number of non-respondents and $n_1$ is the number of distinct respondent $y$ values used as support.
    -   Entries: $$F_{ij} = f_1(y_j \mid x^{\text{un}}_i; \hat\gamma)$$\
        where $\hat\gamma$ is the (estimated) parameter vector of the conditional-density model fit on respondents. This corresponds to the empirical approximation in equation (12): $$\hat f_1(y_j) \propto \sum_{k:\,\delta_k=1} f_1(y_j \mid x_k; \hat\gamma).$$

2.  Column-normalizer vector $C$ (denoted in code as `C_matrix_nieobs`):

    -   Size: $n_1 \times 1$ (a column vector).
    -   Entries: the column-sums of the conditional densities evaluated at respondent covariates: $$C_j = C(y_j;\hat\gamma) = \sum_{k:\,\delta_k=1} f_1(y_j \mid x^{\text{obs}}_k;\hat\gamma).$$ Conceptually this is the denominator that appears when fractional weights are formed (see below).

3.  Odds matrix $O(\phi)$ (constructed by `generate_Odds`):

    -   Size: $n_0 \times n_1$.

-   Entries (for non-respondent $i$ and candidate $y_j$): $$O_{ij}(\phi) = \frac{1-\pi(x^{\text{miss,un}}_i, y_j;\phi)}{\pi(x^{\text{miss,un}}_i, y_j;\phi)}$$ The implementation exploits the separability of the linear predictor in the parameters: $$\eta_{ij} = \beta_0 + X^{\text{miss,un}}_i\beta_{\text{miss}} + \beta_y y_j,$$ and uses `outer()` to form the $n_0\times n_1$ matrix efficiently.

Using these objects we form a non-normalized weight matrix $$U_{ij}(\phi) = O_{ij}(\phi) \cdot F_{ij}$$ and then a normalized fractional-weight matrix $$W_{ij}(\phi) = \frac{U_{ij}(\phi)}{C_j} = \frac{O_{ij}(\phi)\,f_1(y_j\mid x^{\text{un}}_i;\hat\gamma)}{\sum_{k:\,\delta_k=1} f_1(y_j\mid x^{\text{obs}}_k;\hat\gamma)}.$$

These $W_{ij}$ match the weights appearing in the theoretical mean score approximation (equation (15) in the notes): they are the fractional contribution of each imputed $(i,j)$ pair to the conditional expectation for unit $i$.

## The vectorized score S_2 and matrix algebra

Recall the mean-score approximation that leads to the estimating equation used to solve for $\phi$ (cf. equation (13) in the notes): $$
S_2(\phi; \hat\phi^{(t)}, \hat\gamma) = \sum_{r=1}^n \Big[ \delta_r s(\phi;\delta_r, x^{\text{miss}}_r, x^{\text{out}}_r, y_r) + (1-\delta_r)\widetilde{E}_0\{ s(\phi;\delta, x^{\text{miss}}_r, x^{\text{out}}_r, Y) \mid x^{\text{out}}_r;\hat\phi^{(t)},\hat\gamma\} \Big] = 0.
$$

It is convenient to decompose this as the sum of observed and unobserved contributions: $$
S_2(\phi) = S_{\text{obs}}(\phi) + S_{\text{un}}(\phi),
$$ where $$
S_{\text{obs}}(\phi) = \sum_{r:\,\delta_r=1} s(\phi;\delta=1, x^{\text{miss}}_r, x^{\text{out}}_r, y_r)
$$ is the score contribution from respondents, and the missing-unit contribution is approximated by the discrete-support expectation $$
S_{\text{un}}(\phi) \approx \sum_{i:\,\delta_i=0} \sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j).
$$ The remainder of this section explains how the double sum in $S_{\text{un}}(\phi)$ is computed via matrix operations.

The crucial observation for vectorization is that the inner conditional expectation for a non-respondent $i$ is approximated by a weighted finite-sum over the respondent support $\{y_j\}$: $$
\widetilde{E}_0\{ s(\phi;\delta, x^{\text{miss}}_i, x^{\text{out}}_i, Y) \mid x^{\text{out}}_i\} \approx \sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j).
$$

Stacking the non-respondent expectations across all non-respondents gives a single matrix operation. Let

-   For each pair $(i,j)$ evaluate the vector-valued score s at $(\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j)$. Collect these quickly by exploiting the algebraic factorization of the score (see the `s_function` implementation): many parameter-specific components are separable in $i$ and $j$ which allows creation of low-memory representations.

-   Denote by $S^{(0)}_{ij}$ the p-dimensional score vector at $(i,j)$. Organize these so that for each parameter index $m$ we have an $n_0\times n_1$ matrix of values $[S^{(0)}_{\bullet\bullet}]_{m}$ (parameter-wise maps). Then the non-respondent contribution to the overall score vector is $$
    \sum_{i=1}^{n_0} \sum_{j=1}^{n_1} W_{ij}(\phi)\, S^{(0)}_{ij} = \left[ \,\text{vec}
    \big( W \circ [S^{(0)}]_{1} \big)\, ,\, \ldots\, ,\, \text{vec}\big( W \circ [S^{(0)}]_{p} \big)\,\right]^\top
    $$ where $\circ$ denotes elementwise multiplication and `vec` followed by an appropriate collapse (row-sum or column-sum) implements the inner summation depending on the parameter's factorization. In concrete computational terms:

-   For parameter components that multiply per-row (i.e. depend only on $x_i$ times a factor that is a function of $y_j$) we compute elementwise products between $W$ and a factor matrix and then row-sum across $j$ to get an $n_0\times 1$ contribution per non-respondent, and then sum across $i$.

-   For intercept-like or column-wise components, a column-sum followed by weighted multiplication suffices.

In the implementation this reduces to a sequence of dense matrix operations and row/column sums rather than explicit loops over the expanded index set of length $n_0\times n_1$. This yields large speed and memory benefits for real datasets.

Concise vectorized S_2 recipe (conceptual):

1.  Build $F$ (size $n_0\times n_1$) via `generate_conditional_density_matrix(model)`.
2.  Build $C$ (size $n_1$) via `generate_C_matrix(model)` by summing the conditional densities over respondents.
3.  For a candidate $\phi$ compute $O(\phi)$ (size $n_0\times n_1$) via `generate_Odds(model,\phi)`.
4.  Form $W(\phi)=\dfrac{O(\phi)\circ F}{\mathbf{1}_{n_0} C^\top}$ (i.e. divide each column of $O\circ F$ by the corresponding scalar $C_j$).
5.  Compute observed-score sum: $S_{\text{obs}}(\phi)=\sum_{r:\,\delta_r=1} s(\phi;\delta=1,x^{\text{miss}}_r,x^{\text{out}}_r,y_r)$ (this is small: one score vector per respondent).
6.  Compute non-respondent expected-score: use $W(\phi)$ and parameter-wise factor matrices derived from $s(\cdot)$ to compute $$S_{\text{un}}(\phi)=\sum_{i=1}^{n_0}\sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0,x^{\text{miss}}_i,x^{\text{out}}_i,y_j)$$ implemented via matrix multiplications and row/column sums.
7.  Return the total score $S_2(\phi)=S_{\text{obs}}(\phi) + S_{\text{un}}(\phi)$.

The root-finder then searches for $\phi$ such that $S_2(\phi)=0$. In the package this is implemented by repeatedly forming $O(\phi)$ for the current candidate $\phi$, computing $W(\phi)$ and $S_2(\phi)$, and letting `nleqslv` perform the iteration.

## Mean estimation and survey weights

After the response-model parameters $\hat\phi$ are obtained the package reports a mean estimate of the target outcome using an inverse-probability reweighting form. Let respondents be indexed by $j=1,\dots,n_1$; denote by $\pi_j = \pi(x^{\text{miss}}_j,y_j;\hat\phi)$ the fitted response probabilities evaluated at each respondent (here we write $x^{\text{miss}}_j$ for the covariates used in the missingness model evaluated at respondent $j$). With design weights $w_j$ (by default $w_j\equiv 1$ for non-survey data) the point estimator computed in the implementation is $$
\hat\mu = \frac{\sum_{j=1}^{n_1} w_j\, y_j / \pi_j}{\sum_{j=1}^{n_1} w_j / \pi_j}.\tag{mean-est}
$$

Notes on survey weights and $f_1$ fitting:

-   The conditional-density fit used to build $F$ can incorporate respondent sampling weights when provided (the implementation passes respondent weights to the density-fitting routine). Thus $f_1(\cdot\mid x;\gamma)$ may be a weighted fit when `design_weights` or a survey `design` is supplied.
-   The mean-estimate (mean-est) also uses design weights $w_j$ in both numerator and denominator as shown above. In code these are provided via `model$design_weights` and default to 1 for non-survey use.

In short: survey weights enter two places — (i) the conditional-density estimation (if requested) and (ii) the final mean calculation through the weighted IPW-type ratio in (mean-est).

## Arguments passed to `exptilt` (summary)

The `exptilt` / `exptilt.data.frame` user-facing function accepts a number of arguments that control model specification and computation; the most important are:

-   `data`: a `data.frame` with outcome and covariates.
-   `formula`: a partitioned formula of the form `y ~ aux1 + aux2 | miss1 + miss2` where the left part (before `|`) lists `covariates_for_outcome` and the right part lists `covariates_for_missingness` (the package helper splits these automatically).
-   `auxiliary_means`: (optional) population or target means for auxiliary variables used for scaling.
-   `standardize`: logical, whether to standardize features before fitting.
-   `prob_model_type`: character, either `"logit"` or `"probit"` to select the response probability family.
-   `y_dens`: choice of conditional-density family; `"auto"`, `"normal"`, `"lognormal"`, or `"exponential"`.
-   `variance_method`: `"delta"` or `"bootstrap"` for variance estimation.
-   `bootstrap_reps`: number of bootstrap replications when bootstrap is used.
-   `control`: list of control parameters forwarded to the nonlinear solver (`nleqslv`).
-   `stopping_threshold`: numeric; the sup-norm threshold for early stopping of the score (see above).
-   `on_failure`: behavior on failure (`"return"` or `"error"`).
-   `supress_warnings`: logical to silence certain warnings.
-   `design_weights`: optional vector of respondent design weights (or full-sample weights which are subset internally).
-   `survey_design`: optional `survey.design` object; when provided some internal logic uses the survey path.
-   `trace_level`: integer controlling verbosity.
-   `sample_size`: integer for stratified subsampling used when data are large.
-   `outcome_label`, `user_formula`: utility arguments used for presentation and bookkeeping.

These arguments appear in the `exptilt.data.frame` function signature and control how the matrices $F$, $C$, and $O$ are built and how the solver is run.

## Connection to the EM viewpoint

The EM-like update displayed in the theoretical notes (equation (14)) $$\hat\phi^{(t+1)} \leftarrow \text{solve } S_2(\phi \mid \hat\phi^{(t)},\hat\gamma)=0$$ is exactly what the implementation achieves: for a fixed conditional-density estimate $\hat\gamma$ and a current $\hat\phi^{(t)}$, the expectations for the missing units are approximated by the discrete support on observed $y_j$ and the resulting equation for $\phi$ is solved (via root-finding). The heavy-lift is performed by the matrix calculus described earlier — constructing $F$, $C$, $O$ and then computing $W$ and multiplying by the parameter-wise score factors.

## Stopping criterion (maximum-norm)

Practical optimization requires a convergence criterion. The implementation uses the maximum absolute component of the vector-valued score as a stopping rule. Concretely, if the solver produces a score vector $S_2(\phi)$ with $$\max_{m=1,\dots,p} |S_{2,m}(\phi)| < \varepsilon$$ for a user-specified `stopping_threshold = \varepsilon`, the algorithm treats this as converged. In the code this is used as an early-exit inside the target-function passed to the nonlinear solver: when the score's sup-norm is below the threshold, a zero vector is returned to signal convergence and avoid further unnecessary computations.

This choice matches the intuition that the root-finder should stop when the estimating equations are (componentwise) negligibly small.

## Practical tutorial: from raw data to matrix operations (conceptual steps)

1.  Fit the conditional density $f_1(y\mid x;\gamma)$ using respondents and `covariates_for_outcome`. This gives a function that evaluates $f_1(y, x)$ for any $y$ and any covariate row $x$ (used for both respondents and non-respondents).

2.  Evaluate the conditional density at the Cartesian product of non-respondent covariates and observed respondent $y$ values to form $F$ (done by `generate_conditional_density_matrix`). This is the empirical imputation support. Think of rows as target non-respondents and columns as candidate respondent outcomes.

3.  Evaluate the same conditional density at respondent covariates to form the column-normalizer $C$ (done by `generate_C_matrix`) — this is simply the column-sum of densities over respondents.

4.  For each trial value $\phi$ of the response-model parameters, compute the odds matrix $O(\phi)$ using the separable linear predictor and link function (implemented efficiently via `outer()` in code). Combine $O$ with $F$ and normalize columns by $C$ to obtain $W(\phi)$.

5.  Use the vectorized `s_function` to obtain parameter-specific factor matrices for the non-respondent-imputed scores; multiply (elementwise) by $W(\phi)$ and reduce (row/column sums) to compute the non-respondent contribution to $S_2(\phi)$.

6.  Add the observed-respondent score and use a root-finder (e.g. `nleqslv`) to find $\phi$ with $S_2(\phi)=0$. The solver may use the maximum-norm stopping threshold described above to exit early.

## Why the vectorization matters (practical remarks)

-   Memory: the naive expansion to an explicit dataset of size $n_0\times n_1$ would store duplicated covariate rows and blow up memory. The implemention exploits separability (intercept + $X^{\text{miss}}_i\beta_{\text{miss}}$ + $\beta_y y_j$) and vectorized R primitives (`outer`, matrix multiplications, column/row sums) to avoid large temporary allocations.
-   Speed: elementwise operations on dense matrices plus BLAS-accelerated matrix multiplication are much faster than interpreted loops in R for typical dataset sizes.
-   Clarity: organizing logic as the three matrices $F$, $C$, and $O$, followed by elementwise combination and reductions, makes the relationship between the statistical approximation and the implementation transparent and easier to reason about.

## Closing notes and references

This vignette mapped the implementation functions to the math in the theoretical notes and showed how the EM-like mean-score step reduces to a small set of matrix operations. The implementation follows the ideas described in Riddles et al. (2016) for exponential tilting under NMAR: fit the conditional density on respondents, approximate the missing-data expectation by a finite sum over observed $y$ values, and solve the resulting estimating equations for the missingness-model parameters.