---
title: "Bayesian Model Averaging"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bayesian Model Averaging}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(StanMoMo)
```

The `StanMoMo` package includes two methods for model selection and model averaging based on leave-future-out validation, called *stacking* and *pseudo-BMA*. First we briefly discuss why the standard Bayesian model averaging and leave-one-out cross-validation should be avoided for mortality forecasting. Second, we explain with some mathematical details how the methods of *stacking* and *pseudo-BMA* can be used for mortality forecasting based on future-out validation.


## Introduction 

Instead of choosing one model, model averaging stems from the idea that a combination of candidate models among a model list $\mathcal{M}=(M_1,\dots,M_K)$ may perform better than one single model. The standard Bayesian approach, called \textit{Bayesian model averaging} (BMA), consists in weighing each model by its posterior model evidence. This approach should be avoided for mortality forecasting for several reasons. Among them, BMA is very sensitive to prior choices and tends to select only one model asymptotically (see e.g. Fernandez et al. (2001)). Moreover, like the Bayes Information Criterion (BIC), BMA measures how well the model fits the past but not how well the model predicts the future.

As an alternative approach, different authors considered model selection and averaging based on prediction performance on hold-out data. For instance, Yao et al. (2018) proposed Bayesian model averaging approaches based on leave-one-out cross-validation (LOO-CV). While this method is valid as a measure of fit, LOO-CV is problematic for forecasting. 

Indeed, leaving out only one observation at a time will allow information from the future to influence predictions of the past (i.e., data from times $t + 1, t + 2, \dots,$ would inform predictions for time $t$). Instead, it is more appropriate to use leave-future-out validation. In our context of mortality forecasting, instead of leaving one point out, we leave the last $M$ years of data out and evaluate the prediction performance on these $M$ years. This approach proposed in Barigou et al. (2023) is further discussed in the next section.

## Bayesian model averaging via stacking and pseudo-BMA

In order to select the weights by validation, the data is split into a training set and a validation set as follows:

- $y_{1:N}=(d_{x,t},e_{x,t})$ for all $x$ and $t=t_1,\dots,t_N$ is the data of deaths and exposures for the first $N$ years which is used to fit the model. 
- $y_{N+1:N+M}=(d_{x,t},e_{x,t})$ for all $x$ and $t=t_{N+1},\dots,t_{N+M}$ is the data of deaths and exposures for the remaining $M$ years which is used to validate the model. 

After fitting the mortality model to $y_{1:N}$, we can obtain an empirical distribution of the future death rates $\mu_{x,t}$ for $t=t_{N+1},\dots,t_{N+M}$ based on the MCMC samples obtained via Stan. The *stacking* approach searches for the averaged model which maximizes the likelihood of the observed number of deaths on the validation set. In simple words, if the validation length is $M=10$, the stacking approach will obtain the best forecasting averaged model for the last 10 years of data. The **stacking** maximization problem can be expressed as
$$ 
\max _{w \in \mathcal{S}_{1}^{K}} \sum_{x=x_1}^{x_n}  \sum_{j=t_{N+1}}^{t_{N+M}} \log \sum_{k=1}^{K} w_{k} p\left(d_{x,j} \mid y_{1: N},M_k\right),
$$
where

- $\mathcal{S}_{1}^{K}=\left\{w \in[0,1]^{K}: \sum_{k=1}^{K} w_{k}=1\right\}$.
- $K$ is the number of models.
- $p\left(d_{x,j} \mid y_{1: N},M_k\right)$ is the posterior predictive density which can be estimated with the MCMC samples.

By construction, the averaged distribution maximizes the log likelihood of the observed number of deaths in the validation set among all distributions in the convex hull: $$\mathcal{C}=\left\{\sum_{k=1}^{K} w_{k} \times p\left(\cdot \mid M_{k}\right): \sum_{k} w_{k}=1, w_{k} \geq 0\right\}$$

As an alternative approach, we consider an AIC-type weighting using leave-future-out validation, called *pseudo-BMA*. This method is based on the expected log predictive density (elpd) which is an aggregate measure of fit which is in our mortality model given by
$$
\operatorname{elpd}^{k}= \sum_{x=x_1}^{x_n}  \sum_{j=t_{N+1}}^{t_{N+M}} \log p\left(d_{x,j} \mid y_{1: N},M_k\right).
$$
Typically, $\operatorname{elpd}^{k}$ will be higher if the model $M_k$ better predicts the observed deaths in the validation set. The **pseudo-BMA** weighting is then given by 
$$ 
w_{k}=\frac{\exp \left(\operatorname{elpd}^{k}\right)}{\sum_{k=1}^{K} \exp \left(\operatorname{elpd}^{k}\right)}.
$$
Hence, the pseudo-BMA finds the relative weights proportional to the elpd of each model. As discussed in our paper Barigou et al. (2021), while this method provides a simple weight mechanism, stacking tends to outperform pseudo-BMA.

## Implementation

The package includes the `mortality_weights` function which performs Bayesian model averaging via stacking and pseudo-BMA. First, the user fits the different models with an extra argument `validation` which represents the number of years of validation. In a second step, the user calls the `mortality_weights` function with the list of model fits as argument. 

For instance, if we want to apply the averaging methods to the LC, APC and RH models to French data with 10 years of validation for ages 50-90 and years 1970-2017, the weights can be obtained via

```{r,eval=FALSE}
#We extract deaths and exposures
ages.fit<-50:90
years.fit<-1970:2017
deathFR<-FRMaleData$Dxt[formatC(ages.fit),formatC(years.fit)]
exposureFR<-FRMaleData$Ext[formatC(ages.fit),formatC(years.fit)]
#We fit the three mortality models
fitLC=lc_stan(death = deathFR,exposure=exposureFR, forecast = 10, validation=10,family = "poisson",cores=4)
fitRH=rh_stan(death = deathFR,exposure=exposureFR, forecast = 10, validation=10,family = "poisson",cores=4)
fitAPC=apc_stan(death = deathFR,exposure=exposureFR, forecast = 10, validation=10,family = "poisson",cores=4)
#We compute the model weights
model_weights<-mortality_weights(list(fitLC,fitRH,fitAPC))
```

## References

Fernandez, C., Ley, E., & Steel, M. F. (2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100(2), 381-427.

Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis, 13(3), 917-1007.

Barigou, K., Goffard, P. O., Loisel, S., & Salhi, Y. (2023). Bayesian model averaging for mortality forecasting using leave-future-out validation. International Journal of Forecasting, 39(2), 674-690.