---
title: "Statistics and characteristics of the MLE"
output:
    rmarkdown::html_vignette:
        toc: true
vignette: >
  %\VignetteIndexEntry{Statistics and characteristics of the MLE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>")
oldopts <- options(digits = 3, scipen = 999)
print_num <- function(x) print(sprintf("%.3f", x))
```


<!-- badges: start -->
[![GPL-3 License](https://img.shields.io/badge/license-GPL--3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
<!-- badges: end -->

`algebraic.mle` is an R package that provides an algebra over
Maximum Likelihood Estimators (MLEs). These estimators possess
many desirable, well-defined statistical properties which the package
helps you manipulate and utilize.

## Installation

The R package `algebraic.mle` can be installed from GitHub by using the devtools package
in R:
```{r install, eval = FALSE}
install.packages("devtools")
devtools::install_github("queelius/algebraic.mle")
```

```{r setup, include = FALSE}
library(algebraic.dist)
library(algebraic.mle)
library(numDeriv)
library(MASS)
library(ggplot2)
library(tibble)
library(mvtnorm)
set.seed(1234)
```

## Normal distribution

We are going to the classic Normal distribution to demonstrate how to use
`algebraic.mle`. We are using it for a few reasons:

1. It's well-understood, so we can compare our results to the known results.
2. It's a very common distribution, so it's useful to have a good understanding
of its properties.
3. The MLE is multivariate, so we can demonstrate how to use `algebraic.mle` for
multivariate distributions.


So, first, we define a simple MLE solver for the normal distribution.


```{r}
fit_normal <- function(data) {
    sigma <- function(data) {
        mean((data - mean(data))^2)
    }
    loglik <- function(par, data) {
        n <- length(data)
        -n / 2 * log(2 * pi * par[2]) - 1 / (2 * par[2]) *
            (sum(data^2) - 2 * par[1] * sum(data) + n * par[1]^2)
    }
    par.hat <- c(mu = mean(data), var = sigma(data))
    H <- numDeriv::hessian(func = loglik, x = par.hat, data = data)
    algebraic.mle::mle(
        theta.hat = par.hat,
        loglike = loglik(par.hat, data),
        score = numDeriv::grad(func = loglik, x = par.hat, data = data),
        sigma = MASS::ginv(-H),
        info = -H,
        obs = NULL,
        nobs = length(data),
        superclasses = c("mle_normal"))
}
```

As you can see, we return an `mle` object, and then we give it a sub-class
`mle_normal` (it is also a subclass of `mle` and `algebraic.dist`'s `dist`)
so we can specialize some of the methods for MLE of the normal distribution,
e.g., `bias.mle_normal` which we show later.

## Monte-carlo (MC) simulation of the sampling distribution of the MLE

Let's define `theta_samp_mc`, which stands for the Monte Carlo simulation
of the sampling distribution of the MLE. It takes a sample size `n`, a true
parameter value `theta`, and a number of simulations `B` to run. It returns a matrix
with `B` rows and two columns, the first column is the MLE of the mean and the
second column is the MLE of the variance.

```{r theta-samp-mc}
theta_samp_mc <- function(n, theta, B = 10000) {
    mu <- theta[1]
    var <- theta[2]
    mles <- matrix(NA, nrow = B, ncol = 2)
    for (i in 1:B) {
        d <- rnorm(n, mean = mu, sd = sqrt(var))
        mles[i, ] <- params(fit_normal(d))
    }
    colnames(mles) <- c("mu", "var")
    mles
}
```

```{r mc-samp, cache=TRUE}
# Set up the parameters of a simulation
set.seed(913254)
n <- 70
mu <- 1
var <- 1
B <- 1000
theta <- c(mu, var)
mles <- theta_samp_mc(n = n, theta = theta, B = B)
head(mles)
```

The matrix `mles` is a sample of MLEs from the sampling distribution of the MLE. It
is an *empirical distribution* of the MLE $(\mu, \sigma^2)'$ from samples of size $n$
$X_i \sim N(\mu, \sigma^2)$ for $i=1,\ldots,n$.

This particular example is Monte Carlo simulation of the sampling
distribution, since we are simulating the sampling distribution by repeatedly
sampling from the population distribution and computing the MLE for each sample.

> In bootstrap, we would *resample* from the sample, not the population, but with a
> large enough sample, the two will produce nearly identical results. See the
> bootstrap section for more details, where we'll compare the two.

For a sufficiently large number of simulations $B$, the empirical sampling
distribution should be very close to the true sampling distribution. We can plot the
empirical sampling distribution of the MLEs using the `plot` function on the
`mles` matrix.

```{r fig.width=4, fig.height=4, echo=FALSE, fig.align="center", fig.cap="Sampling distribution of the MLEs.", warning=FALSE, cache=TRUE}
tmp <- colMeans(mles)
# countour plot of the sampling distribution `mles` and with the point of the
# true mean and variance, theta = (mu, var), shown. label it "true parameter"
ggplot(as_tibble(mles), aes(x = mu, y = var)) +
    labs(x = expression(mu), y = expression(sigma^2)) +
    theme_bw() +
    geom_hline(yintercept = var) +
    geom_vline(xintercept = mu) + 
    geom_point(aes(x = tmp[1], y = tmp[2]), color = "green") +
    geom_point(aes(x = mu, y = var), color = "red") +
    stat_density2d(aes(fill = ..level..), geom = "polygon", alpha = 0.2) +
    scale_fill_gradient(low = "blue", high = "red")
```

In `algebraic.dist`, we can use `empirical_dist` to represent an empirical
sampling distribution by giving it the sample of MLEs previously generated:
```{r}
theta.mc <- algebraic.dist::empirical_dist(mles)
```

In general, for any MLE and assuming the the regularity conditions hold,
the asymptotic sampling distribution of the MLE is normal with mean $\theta$ and
variance-covariance matrix $\Sigma = I^{-1}(\theta)_n$, where $I$ is the Fisher
information matrix and $n$ is the sample size. However, in general:

1. We don't know when the asymptotic sampling distribution is a good approximation
to the true sampling distribution. In these cases, the empirical sampling
distribution may be used instead.
2. We may not be confident our implementation of the MLE is correct,
in which case the empirical sampling distribution can be used to check our
implementation. 
3. The regularity conditions may not hold, in which case the asymptotic sampling
distribution may not be known. In these cases, the empirical sampling distribution
may be used instead.

With these caveats in mind, we compare some of the statistics of the empirical
sampling distribution of the MLE for the normal distribution and the asymptotic
sampling distribution.

Let's look at some basic parameters of the sampling distribution of the MLE for the
normal distribution. First, let's look at the mean:
```{r}
(mu.mc <- mean(theta.mc))
```

The mean looks pretty close to the true parameter vector
$$
    \theta = (\mu = `r mu`, \sigma^2 = `r var`)'.
$$

We can actually compute any parameter, since `theta.mc`, models the concept of a
distribution. In particular, it models a distribution in `algebraic.dist`, and thus
the API exposed by `algebraic.dist` is available to us. For instance, we can compute
various parameters of the sampling distribution of the MLE using the `expectation`
function:
```{r expectation-tests, cache = TRUE}
# should sum to 1
expectation(theta.mc, function(x) 1)
# mean
expectation(theta.mc, function(x) x)
# variance of (mu, var)
expectation(theta.mc, function(x) (x - mu.mc)^2)
# kurtosis of (mu, var) 
expectation(theta.mc, function(x) (x - mu.mc)^4) /
    expectation(theta.mc, function(x) (x - mu.mc)^2)^2
# skewness of mu and var -- should be (0, 0)
expectation(theta.mc, function(x) ((x - mu.mc) / theta)^3)
# covariance of (mu, var) -- should be around 0
expectation(theta.mc, function(x) (x[1] - mu.mc[1]) * (x[2] - mu.mc[2]))
```

We could use the mean and variance-covariance matrix to parameterize a multivariate
normal distribution (MVN), for instance, but we don't do that here.

## Bias

Bias is a measure of the systematic error of an estimator; it measures how far its 
average value is from the true value being estimated. Formally, it is defined as
the difference between the expected value of the estimator and the true
value of the parameter, i.e.,
$$
\operatorname{Bias}(\hat\theta) = E_{\hat\theta}(\hat\theta) - \theta,
$$
where $E_{\hat\theta}$ denotes the expectation operator with respect to the
sampling distribution of $\hat\theta$. (Normally, we drop the subscript in the
expectation operator and write $E$ instead of $E_{\hat\theta}$ unless it's not
clear from context which expectation operator we are using.)

When the bias is zero, the estimator is *unbiased*, otherwise it is *biased*.
Analytically, the asymptotic bias of the MLE for the parameters of the normal
distribution is
$$
    \operatorname{Bias}(\hat\theta) = \left(\begin{array}{c}
        0 \\
        -\frac{\sigma^2}{n}
    \end{array}\right).
$$
Plugging in the true value of $\sigma^2 = `r var`$ and the sample size $n = `r n`$,
we get $(0, `r var/n`)$. We may also provide an appropriate implementation of the
`bias` method in `algebraic.mle` for `mle_normal` (which is what we called our
the object that we returned from `fit_normal`):
```{r}
bias.mle_normal <- function(x, par = NULL, ...) {
    if (is.null(par)) {
        par <- params(x)
    }
    c(mu = 0, var = -(1 / nobs(x)) * par[2])
}
```

Now, let's compute the bias using this function, and the estimate of the bias
provided by the `bias.mle_emp`:
```{r bias-1}
# first, we sample some data from the true distribution
data <- rnorm(n = n, mean = mu, sd = sqrt(var))

# now we fit it to the normal distribution
theta.hat <- fit_normal(data)

# now we compute the bias, first using the asymptotic theory
bias(theta.hat, theta)
# now using the empirical sampling distribution
expectation(theta.mc, function(x) x - theta) # mean(theta.mc) - theta
```

The asymptotic bias and the empirical bias are pretty close.
Let's see how the bias of the variance changes as the sample size increases.

```{r bias-code, cache = TRUE}
N <- 1000
ns <- seq(10, 500, 10)
bias_var <- numeric(length(ns))
j <- 1
for (n in ns) {
    vars <- numeric(length(N))
    for (i in 1:N) {
        d <- rnorm(n = n, mean = mu, sd = sqrt(var))
        fit <- fit_normal(d)
        vars[i] <- params(fit)[2]
    }
    bias_var[j] <- mean(vars) - var
    j <- j + 1
}
```

```{r bias-plot, echo=FALSE, fig.height=4, fig.width=6}
plot(ns, bias_var, type = "l",
    xlab = "Sample size", ylab = "Bias(Var)")
# plot the asymptotic bias
lines(ns, -(1 / ns) * var, lty = 2)
```

### Variance-covariance matrix

The variance-covariance matrix is one of the more important statistical measures
of an estimator of a parameter vector. It quantities both the variability of the
individual parameter estimates and how they co-vary with each other.

The variance-covariance matrix of a parameter vector $\theta = (\theta_1, \ldots, \theta_p)'$
is an $n \times n$ matrix defined as
$$
\operatorname{Var}(\hat\theta) = E_{\hat\theta}\!\bigl[(\hat\theta - E_{\hat\theta}(\hat\theta))
    (\hat\theta - E_{\hat\theta}(\hat\theta))'\bigr].
$$

The $(i, j)$th element of the variance-covariance matrix is the covariance between
the $i$th and $j$th elements of the parameter vector, respectively $\theta_i$ and
$\theta_j$. Thus, the diagonal elements of the variance-covariance matrix are the
variances of the individual parameter estimates, and the off-diagonal elements are
the covariances between the parameter estimates.

```{r}
round(vcov(theta.hat), digits=3)
round(vcov(theta.mc), digits=3)
```

They look reasonably close, suggesting at $n = `r n`$, the asymptotic sampling
distribution is a good approximation to the "true" sampling distribution of the MLE.

### Confidence intervals

We can compute the CI of a parameter using the `confint` function:
```{r confint-1}
confint(theta.hat)
```

A very important measure of the accuracy of an estimator is its coverage
probability, which is the probability that the confidence interval for the
parameter estimate contains the true value of the parameter. If the coverage
probability for an $(1-\alpha) \%$-confidence interval is $1-\alpha$, then the
confidence interval is said to be *well-calibrated*. If the coverage probability
is less than $1-\alpha$, then the confidence interval is said to be
*conservative*; if the coverage probability is greater than $1-\alpha$, then the
confidence interval is said to be *anti-conservative*.

We can estimate it by simulating a large number of samples from the population
distribution and computing the proportion of times the confidence interval
contains the true value of the parameter. We can do this for both the mean and
variance of the normal distribution.

<details>
<summary>Click to show/hide R code</summary>
```{r covearge-prob, cache = TRUE}
N <- 1000
ns <- c(seq(50, 200, 50), 300, 600, 1000)
coverage_prob <- matrix(NA, nrow=length(ns), ncol=2)
j <- 1
for (n in ns) {
    count1 <- 0L
    count2 <- 0L
    for (i in 1:N) {
        d <- rnorm(n = n, mean = mu, sd = sqrt(var))
        fit <- fit_normal(d)
        ci <- confint(fit)
        if (ci[1, 1] <= mu && mu <= ci[1, 2]) {
            count1 <- count1 + 1
        }
        if (ci[2, 1] <= var && var <= ci[2, 2]) {
            count2 <- count2 + 1
        }
    }
    coverage_prob[j, 1] <- count1 / N
    coverage_prob[j, 2] <- count2 / N
    j <- j + 1
}
```
</details>
```{r cov-plot, echo = FALSE, fig.height = 4, fig.width = 5, cache = TRUE}
plot(ns, coverage_prob[, 2], type = "l",
    xlab = "Sample size", col = "blue", ylab = "Coverage")
lines(ns, coverage_prob[, 1], col = "green")
legend("bottomright", legend = c("mu", "var"), col = c("blue", "green"), lty = 1)
abline(h = 0.95, lty = 2)
```

We see that the coverage probability is close to the nominal coverage probability,
and converges to it as the sample size increases. This suggests that the
confidence intervals are well-calibrated.

## Mean squared error matrix
The mean squared error (MSE) of an estimator of a parameter vector $\theta$ is
defined as
$$
\operatorname{MSE}(\hat\theta) = E\bigl[(\hat\theta - \theta)(\hat\theta - \theta)'\bigr],
$$
where $\hat\theta - \theta$ is a column vector of differences between the
estimator and the true parameter and $(\hat\theta - \theta)'$ is a row vector of
the same differences, and we are performing a standard matrix multiplication
between the two vectors. The MSE is a measure of the average squared error of
the estimator. It is a function of the true parameter value $\theta$.

This MSE is a *matrix*. It is very similar to the variance-covariance matrix, which
is defined as
$$
\operatorname{Var}(\hat\theta) = E\bigl[(\hat\theta - E(\hat\theta))
    (\hat\theta - E(\hat\theta))'\bigr],
$$
where we replace the true paramater $\theta$ with the expected value of the estimator
$\hat\theta$. If the estimator is unbiased, then $E(\hat\theta) = \theta$ and
$\operatorname{Var}(\hat\theta) = \operatorname{MSE}(\hat\theta)$.

We not only need to consider the estimation error for each parameter individually,
but also how these errors might relate to each other. For instance, it could be the
case that when we overestimate one parameter, we tend to underestimate another. This
kind of relationship between errors in estimating different parameters can be
captured by the off-diagonal elements of the MSE matrix, which represent the
covariances between errors.

The diagonal elements of the MSE represent the MSE of the individual
parameter estimators, e.g., the $i$th diagonal element represents
$\operatorname{MSE}(\hat\theta_j)$.

The *trace* of the MSE, the sum of the diagonal elements, represents the total
MSE across all parameters. As a single summary statistic, it may be useful for
comparing different estimators.

The MSE can be decomposed into two parts:

1. The *bias*, which is the difference between the expected value of the estimator
and the true parameter value, and
2. The *variance*, which is the variance of the estimator.

The MSE is then computed as the sum of the bias outer product and the
variance-covariance matrix:

$$
\operatorname{MSE}(\hat\theta) = \operatorname{Bias}(\hat\theta)\operatorname{Bias}(\hat\theta)'
    + \operatorname{Var}(\hat\theta).
$$

```{r mse-1}
mse.hat <- mse(theta.hat, theta)
mse.mc <- matrix(expectation(theta.mc,
    function(x) (x - theta) %*% t(x - theta)), nrow = 2)

round(mse.hat, digits = 3)
round(mse.mc, digits = 3)
```

It's hard to distinguish the MSE matrices from the variance-covariance
matrices reported previously, which is not surprising, since the bias is relatively
small and so the MSE is dominated by the variance.

Let's take a closer look at the variance and MSE of the mean $\hat\mu$:
```{r}
# temporarily show more digits in the numbers/outputs for this code block
op <- options(digits = 12)
# mse(mu)
expectation(theta.mc, function(x) (x[1] - mu)^2)
# variance(mu)
(mu.var <- expectation(theta.mc, function(x) (x[1] - mean(theta.mc)[1])^2))
b <- expectation(theta.mc, function(x) x[1] - mu)
# mse = bias^2 + variance
b^2 + mu.var
options(op)
```

They are very close, since the bias is so small.

We should take the MSE from the Monte Carlo simulation as a sort of "true" MSE,
since it is computed from the empirical sampling distribution of the MLE. We expect
that as the sample size increases, the asymptotic MSE (`mse.hat`) and the
MC MSE (`mse.mc`) will converge to the same value. In fact, let's run a little
experiment to show this:

```{r mse-sample-size, cache = TRUE}
ns <- seq(25, 200, 25)
mses.mc <- matrix(NA, nrow = length(ns), ncol = 2)
mses.hat <- matrix(NA, nrow = length(ns), ncol = 2)
mses.hat.hat <- matrix(NA, nrow = length(ns), ncol = 2)
j <- 1
for (n in ns) {
    theta.n <- empirical_dist(theta_samp_mc(n = n, theta = theta, B = B))
    # Use mean() on the sampled distribution for MSE calculation
    samples <- obs(theta.n)
    mse.mu.n <- mean((samples[, 1] - mu)^2)
    mse.var.n <- mean((samples[, 2] - var)^2)
    data <- rnorm(n = n, mean = mu, sd = sqrt(var))
    fit <- fit_normal(data)
    mses.mc[j, ] <- c(mse.mu.n, mse.var.n)
    mses.hat[j, ] <- diag(mse(fit, theta))
    mses.hat.hat[j, ] <- diag(mse(fit))
    j <- j + 1
}
```

```{r mse-plots, echo = FALSE, fig.height = 4, fig.width = 6}
plot(ns, mses.mc[, 1], type = "l", col = rgb(0,0,1,0.5), ylab = "MSE(mu)", xlab = "Sample size") # blue with transparency
lines(ns, mses.hat[, 1], col = rgb(0,1,0,0.5)) # green with transparency
lines(ns, mses.hat.hat[, 1], col = rgb(1,0,0,0.5)) # red with transparency
legend("topright", legend = c("MC", "Asymptotic", "Asymptotic (estimate)"), 
    col = c(rgb(0,0,1,0.5), rgb(0,1,0,0.5), rgb(1,0,0,0.5)), lty = 1)

plot(ns, mses.mc[, 2], type = "l", col = rgb(0,0,1,0.5), ylab = "MSE(var)", xlab = "Sample size") # blue with transparency
lines(ns, mses.hat[, 2], col = rgb(0,1,0,0.5)) # green with transparency
lines(ns, mses.hat.hat[, 2], col = rgb(1,0,0,0.5)) # red with transparency
legend("topright", legend = c("MC", "Asymptotic", "Asymptotic (estimate)"), 
    col = c(rgb(0,0,1,0.5), rgb(0,1,0,0.5), rgb(1,0,0,0.5)), lty = 1)
```


These plots demonstrate that the asymptotic MSE is a good approximation to the
"true" MSE, which is the MSE computed from the empirical sampling distribution of
the MLE.

It's difficult to distinguish the estimated asymptotic MSE, where the true
parameter $\theta$ is not known, from the asymptotic MSE, where the true parameter
$\theta$ is known. This is because the bias is so small, and so the MSE is
dominated by the variance.

## Bootstrap of the sampling distribution of the MLE

Normally, we don't know the true data generating process (DGP) of the data we
observe. We only have a sample of data, and we want to use that sample to
estimate the parameters of some model that hopefully provides a good fit to the
DGP using maximum likelihood estimation.

Earlier, we simulated a sample of data from a normal distribution with mean 1 and
variance 1 and then used MLE on each sample to generate an empirical sampling
distribution of the MLE. This is called *Monte Carlo simulation*.

However, we can also use the sample of data we have to generate an empirical
sampling distribution of the MLE. This is called *bootstrap*. The idea is that
the sample we have is a sample from the true DGP, and we can use that sample to
generate new samples (resample) and fit an MLE to each of these to generate an
Bootstrapped empirical sampling distribution of the MLE.

```{r mle-boot, cache=TRUE}
# Simulate a sample of n observations from a normal with mean 1 and variance 2.
library(boot)
theta.boot <- mle_boot(boot(
    data = data,
    statistic = function(x, ind) {
        params(fit_normal(x[ind]))
    },
    R = B))
```

Let's compute some statistics:
```{r}
params(theta.boot)
confint(theta.boot)
```


Let's use these Bootstrapped MLEs to generate an aproximation of the empirical
sampling distribution:
```{r}
theta.b <- empirical_dist(theta.boot$t)
```

As before, let's do some basic expectations of the Bootstrapped sampling distribution
of the MLE and compare to the previous results:
```{r expectation-tests-boot, cache = TRUE}
# should sum to 1
expectation(theta.b, function(x) 1)
# mean
(mu.b <- mean(theta.b))
# variance of (mu, var)
expectation(theta.b, function(x) (x - mu.b)^2)
# kurtosis of (mu, var) 
expectation(theta.b, function(x) (x - mu.b)^4) /
    expectation(theta.b, function(x) (x - mu.b)^2)^2
# skewness of mu and var -- should be (0, 0)
expectation(theta.b, function(x) ((x - mu.b) / theta)^3)
# covariance of (mu, var) -- should be around 0
expectation(theta.b, function(x) (x[1] - mu.b[1]) * (x[2] - mu.b[2]))
```

These are not too bad.

Let's compute the bias and compare it to the previous results:
```{r bias-2}
bias(theta.boot)
expectation(theta.mc, function(x) x - theta)
bias(theta.hat, theta)
```

We see that the `bias` function for `mle_boot` is not too bad. Note that the
`bias` is an expectation w.r.t. the sampling distribution of the MLE.
In general, we can have a better estimator if we use
$$
    \hat\theta^* = \hat\theta - \operatorname{Bias}(\hat\theta),
$$
assuming the bias estimate is accurate. In this particular example that
transformation makes it worse, which is fine, the bias of the transformed
estimator would be less in theory. Howevever, in practice, we don't trust the bias
reported by the Bootstrap, except as evidence that our estimator is biased or not.
The analytic bias, `bias.mle_normal`, is more accurate, and will generally produce
estimators with less bias (although by the bias-variance trade-off, it may have
more variance).

Let's compare the variance-covariance matrix of the
Bootstrapped sampling distribution of the MLE to the "true" sampling distribution
and the asymptotic sampling distribution:
```{r}
round(vcov(theta.b), digits = 3)
round(vcov(theta.mc), digits = 3)
round(vcov(theta.hat), digits = 3)
```

They are all pretty close. Let's generate the coverage probability for the
Bootstrapped CIs:
<details>
<summary>Click to show/hide R code</summary>
```{r coverage-prob-boot, cache = TRUE}
N <- 100
ns <- c(50, 100)
coverage_prob <- matrix(NA, nrow=length(ns), ncol=2)
j <- 1
for (n in ns) {
    count1 <- 0L
    count2 <- 0L
    for (i in 1:N) {
        d <- rnorm(n = n, mean = mu, sd = sqrt(var))
        fit.boot <- mle_boot(boot(
            data = d,
            statistic = function(x, ind) {
                params(fit_normal(x[ind]))
            },
            R = 250))
        ci <- confint(fit.boot)
        if (ci[1, 1] <= mu && mu <= ci[1, 2]) {
            count1 <- count1 + 1
        }
        if (ci[2, 1] <= var && var <= ci[2, 2]) {
            count2 <- count2 + 1
        }
    }
    coverage_prob[j, 1] <- count1 / N
    coverage_prob[j, 2] <- count2 / N
    #cat("n = ", n, ", coverage = ", coverage_prob[j, ], "\n")
    j <- j + 1
}
```
</details>
```{r cov-plot-boot, echo = FALSE, fig.height = 4, fig.width = 5, cache = TRUE}
plot(ns, coverage_prob[, 2], type = "l",
    xlab = "Sample size", col = "blue", ylab = "Coverage")
lines(ns, coverage_prob[, 1], col = "green")
legend("bottomright", legend = c("mu", "var"), col = c("blue", "green"), lty = 1)
abline(h = 0.95, lty = 2)
```

## Prediction intervals

Frequently, we are actually interested in predicting the outcome of the
random variable (or vector) that we are estimating the parameters of.

We observed a sample $\mathcal{D} = \{X_i\}_{i=1}^n$ where $X_i \sim \operatorname{Normal}(\mu, \sigma)$,
$\theta = (\mu, \sigma)$ not known. To estimate $\theta$, we solved the MLE which,
asymptotically, is normally distributed with a mean $\theta$ and a variance-covariance
given by the inverse of the FIM (or, using the Bootstrap, by estimating the
covariance of the sampling distribution of the Bootstrapped MLEs).

We wish to model the uncertainty of a new observation, $\hat{X}_{n+1}|\mathcal{D}$. We do so
by considering both the uncertainty inherent to the Normal distribution and
the uncertainty of our estimate $\hat\theta$.
In particular, we let $\hat{X}_{n+1}|\hat\theta \sim \operatorname{Normal}(\hat\theta)$
and $\hat\theta \sim \operatorname{MVN}(\theta,I^{-1}(\theta)/n)$.
Then, the joint distribution of $\hat{X}_{n+1}$ and $\hat\theta$ has the pdf
given by
$$
    f(t,\theta) = f_{\hat{X}|\hat\theta}(x|\theta) f_{\hat\theta}(\theta),
$$
and thus to find $f(t)$, we marginalize over $\theta$, obtaining
$$
    f(x) = \int_{R^2} f_{\hat{X}_{n+1},\hat\theta}(x,\theta) d\theta.
$$

Given the information in the sample, the uncertainty in the new observation
is characterized by the distribution
$$
    \hat{X}_{n+1} \sim f(x).
$$

It has greater variance than $X_{n+1}|\hat\theta$ because, as stated earlier, we do
not know $\theta$, we only have an uncertain estimate $\hat\theta$.

In `pred`, we compute the predictive interval (PI) of the
distribution of $\hat{X}_{n+1}$ using Monte Carlo integration, i.e., sum over a
large number of draws from the joint distribution of $\hat{X}_{n+1}$ and $\hat\theta$
and then compute the empirical quantiles.

The function `pred` takes as arguments `x`, in this case an `mle` object, 
and a sampler for the distribution of the random variable of interest, in this
case `rnorm` (the sampler for the normal distribution). The sampler
must be compatible with the parameter value of `x` (i.e., `params(x)`).
Here is how we compute the PI for $\hat{X}_{n+1}$:
```{r}
samp <- function(n, par) rnorm(n = n, mean = par[1], sd = sqrt(par[2]))
pred(x = theta.hat, samp = samp)
```

How does this compare to $X_{n+1}|\hat\theta$?
```{r}
par <- params(theta.hat)
mu.hat <- par[1]
var.hat <- par[2]
c(mu.hat, qnorm(c(.025,.975), mean = mu.hat, sd = sqrt(var.hat)))
```

We see that the 95% quantile interval for $X_{n+1}|\hat\theta$ is a bit
smaller than $\hat{X}_{n+1}$, which is what we expected. Of course, for
sufficiently large sample sizes, they will converge to the same quantiles.

## Weighted MLE: a weighted sum of maximum likelihood estimators

Since the variance-covariance of an MLE is inversely proportional to the
FIM that the MLE is defined with respect to, we can combine
multiple MLEs of $\theta$, each of which may be defined with respect to a
different kind of sample, to arrive at the MLE that incorporates the Fisher
information in all of those samples.

Consider $k$ mutually independent MLEs of parameter $\theta$,
$\hat\theta_1,\ldots,\hat\theta_k$, where $\hat\theta_j \sim N(\theta,I_j^{-1}(\theta))$.
Then, the sampling MLE of $\theta$ that incorporates all of the data in
$\hat\theta_1,\ldots,\hat\theta_k$ is given by the inverse-variance
weighted mean,
$$
    \hat\theta_w = \left(\sum_{j=1}^k I_j(\theta)\right)^{-1} \left(\sum_{j=1}^k I_j(\theta) \hat\theta_j\right),
$$
which, asymptotically, has an expected value of $\theta$ and a variance-covariance
of $\left(\sum_{j=1}^k I_j(\theta)\right)^{-1}$.

To evaluate the performance of the weighted MLE, we generate a sample of
$N=1000$ observations from $\mathcal{N}(\theta)$ and compute the MLE
for the observed sample, denoted by $\hat\theta$.

We then divide the observed sample into $r=5$ sub-samples, each of size
$N/r=100$, and compute the MLE for each sub-sampled, denoted by
$\theta^{(1)},\ldots,\theta^{(r)}$.

Finally, we do a weighted combination these MLEs to form the weighted MLE,
denoted by $\theta_w$:
```{r}
N <- 100
r <- 5
samp <- rnorm(N, mean = theta[1], sd = sqrt(theta[2]))
samp.sub <- matrix(samp, nrow = r)
mles.sub <- list(length = r)
for (i in 1:r)
    mles.sub[[i]] <- fit_normal(samp.sub[i,])

mle.wt <- mle_weighted(mles.sub)
mle <- fit_normal(samp)
```

We show the results in the following R code.
First, we show the weighted MLE and its MSE:
```{r}
params(mle.wt)
vcov(mle.wt)
```
The MLE for the total sample and its MSE is:
```{r}
params(mle)
vcov(mle)
```

Unfortuantely, $\hat\theta$ is a much better estimator of $\theta$ than
$\hat\theta_w$. According to theory, they should be identical, but in
practice, there may be issues like numerical instability that cause
the weighted MLE to perform poorly.

We are in fact using numerical differentiation to compute the FIM, which may be
a source of error. We can try to improve the accuracy of the FIM by using
a more accurate method of computing the FIM, such as an analytical solution
or a more accurate numerical approximation.

```{r, include = FALSE}
options(oldopts)
```