---
title: "Background"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{background}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Assumptions about missingness

There are three assumptions about the process by which data become missing \[1].

1. *Missing completely at random* (MCAR)
2. *Missing at random* (MAR)
3. *Missing not at random* (MNAR)

## Probabilistic interpretation

The process by which data become missing is *random*, and so missing
data can be formalised from a probabilistic perspective.

## Mathematical formalism

The following unifies the formalisms in \[1], \[2], and \[3].

- $D$ : $n \times p$ data matrix
  - $D = (D_{obs}, D_{mis})$
- $D_{obs}$ : observed values of $D$
- $D_{mis}$ : unobserved values of $D$
- $M$ : $n \times p$ missingness indicator matrix, where $M_{ij} = 1$
if $D_{ij}$ is observed and $0$ otherwise
- $j$ : distinct missing data pattern in $M$
- $J$ : total $j$
- $S_j$ : set of cases with missing data pattern $j$
- $m_j$ : number of cases in $S_j$
- $p_j$ : number of observed variables in $S_j$
- $\boldsymbol{\mu}$ : population mean vector
- $\Sigma$ : population covariance matrix
- $\hat{\boldsymbol{\mu}}$ : ML estimate of $\boldsymbol{\mu}$
- $\hat{\Sigma}$ : ML estimate of $\Sigma$
- $Q_j$ : $p \times p_j$ matrix indicating which variables are observed for pattern $j$
- $\hat{\boldsymbol{\mu}}_{obs,j}$ : subset of $\hat{\boldsymbol{\mu}}$ given by $\hat{\boldsymbol{\mu}}_{obs,j} \equiv \hat{\boldsymbol{\mu}}Q_j$
- $\bar{D}_{obs, j}$ : vector of sample means of observed variables in pattern $j$.
- $\hat{\Sigma}_{obs,j}$ : subset of $\hat{\Sigma}$ given by $\hat{\Sigma}_{obs,j} \equiv Q_j^{\top}\hat{\Sigma} Q_j$
- $\tilde{\Sigma}_{obs, j}$ : accounts for degrees of freedom in $\hat{\Sigma}_{obs,j}$ given by $\tilde{\Sigma}_{obs, j} = \frac{n}{n-1}\hat{\Sigma}_{obs,j}$
- $d^2$ : statistic used to test MCAR where $d^2 = \sum_{j=1}^J m_j (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j}) \tilde{\Sigma}_{obs, j}^{-1}
(\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j})^{\top}$ =


The definitions of MCAR, MAR, and MNAR are based on the probability distribution of $M$.

  - **MCAR**
    - $P(M|D) = P(M)$
  - **MAR**
    - $P(M|D) = P(M|D_{obs})$
  - **MNAR**
    - $P(M|D) = P(M|D)$

The above is summarised informally below \[1].

| Assumption | You can predict $M$ with: |
|--------|--------|
| MCAR | - |
| MAR | $D_{obs}$ |
| MNAR | $D_{obs}$ and $D_{mis}$ |

## References

[1] King G, Honaker J, Joseph A, Scheve K. Analyzing Incomplete Political Science Data: An Alternative
Algorithm for Multiple Imputation. American Political Science Review. 2001 March.

[2] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal
of the American Statistical Association. 1988;83(404):1198-202.

[3] Rubin DB. Inference and Missing Data. Biometrika. 1976;63(3):581-92.

[4] Joseph G Ibrahim HZ, Tang N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm.
Journal of the American Statistical Asso- ciation. 2008;103(484):1648-58. PMID: 19693282.