--- title: "Background" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{background} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Assumptions about missingness There are three assumptions about the process by which data become missing \[1]. 1. *Missing completely at random* (MCAR) 2. *Missing at random* (MAR) 3. *Missing not at random* (MNAR) ## Probabilistic interpretation The process by which data become missing is *random*, and so missing data can be formalised from a probabilistic perspective. ## Mathematical formalism The following unifies the formalisms in \[1], \[2], and \[3]. - $D$ : $n \times p$ data matrix - $D = (D_{obs}, D_{mis})$ - $D_{obs}$ : observed values of $D$ - $D_{mis}$ : unobserved values of $D$ - $M$ : $n \times p$ missingness indicator matrix, where $M_{ij} = 1$ if $D_{ij}$ is observed and $0$ otherwise - $j$ : distinct missing data pattern in $M$ - $J$ : total $j$ - $S_j$ : set of cases with missing data pattern $j$ - $m_j$ : number of cases in $S_j$ - $p_j$ : number of observed variables in $S_j$ - $\boldsymbol{\mu}$ : population mean vector - $\Sigma$ : population covariance matrix - $\hat{\boldsymbol{\mu}}$ : ML estimate of $\boldsymbol{\mu}$ - $\hat{\Sigma}$ : ML estimate of $\Sigma$ - $Q_j$ : $p \times p_j$ matrix indicating which variables are observed for pattern $j$ - $\hat{\boldsymbol{\mu}}_{obs,j}$ : subset of $\hat{\boldsymbol{\mu}}$ given by $\hat{\boldsymbol{\mu}}_{obs,j} \equiv \hat{\boldsymbol{\mu}}Q_j$ - $\bar{D}_{obs, j}$ : vector of sample means of observed variables in pattern $j$. - $\hat{\Sigma}_{obs,j}$ : subset of $\hat{\Sigma}$ given by $\hat{\Sigma}_{obs,j} \equiv Q_j^{\top}\hat{\Sigma} Q_j$ - $\tilde{\Sigma}_{obs, j}$ : accounts for degrees of freedom in $\hat{\Sigma}_{obs,j}$ given by $\tilde{\Sigma}_{obs, j} = \frac{n}{n-1}\hat{\Sigma}_{obs,j}$ - $d^2$ : statistic used to test MCAR where $d^2 = \sum_{j=1}^J m_j (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j}) \tilde{\Sigma}_{obs, j}^{-1} (\bar{D}_{obs, j} - \hat{\boldsymbol{\mu}}_{obs, j})^{\top}$ = The definitions of MCAR, MAR, and MNAR are based on the probability distribution of $M$. - **MCAR** - $P(M|D) = P(M)$ - **MAR** - $P(M|D) = P(M|D_{obs})$ - **MNAR** - $P(M|D) = P(M|D)$ The above is summarised informally below \[1]. | Assumption | You can predict $M$ with: | |--------|--------| | MCAR | - | | MAR | $D_{obs}$ | | MNAR | $D_{obs}$ and $D_{mis}$ | ## References [1] King G, Honaker J, Joseph A, Scheve K. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. American Political Science Review. 2001 March. [2] Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-202. [3] Rubin DB. Inference and Missing Data. Biometrika. 1976;63(3):581-92. [4] Joseph G Ibrahim HZ, Tang N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm. Journal of the American Statistical Asso- ciation. 2008;103(484):1648-58. PMID: 19693282.