--- title: "extrememix R package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `extrememix` implements Bayesian estimation of extreme value mixture models, estimating the threshold over which a Generalized Pareto distribution can be assumed as well as high quantiles and other measures of interest in extreme value theory. ## Installation The package `extrememix` can be installed from GitHub using the command ```{r vign17,warning = FALSE, message = FALSE, eval = F} # install.packages("devtools") devtools::install_github("manueleleonelli/extrememix") ``` and loaded in R with ```{r vign16, message = F} library(extrememix) library(ggplot2) ``` ## An applied analysis We consider the `rainfall` dataset reporting the monthly maxima daily rainfall (in mm) recorded at the Retiro station in the city of Madrid between 1985 and 2020. The data consists of 414 monthly maxima since 18 months were discarded in which no rain was observed. ```{r vign15, out.width="50%", fig.align = "center"} data("rainfall") ggplot(data = data.frame(rainfall), aes(x=rainfall)) + geom_histogram(binwidth = 2*length(rainfall)^(-1/3)*IQR(rainfall), colour="black", fill="white") + theme_bw() ``` The data histogram shows that the maximum rainfall observed in a day is around 50mm. Although there are some extreme observations, the distribution appears to be right-bounded. We start our analysis fitting the GGPD model (gamma bulk/GPD tail) using the function `fggpd`. We specify the number of iterations, the burn-in and the thinning via the options `it`, `burn` and `thin`, respectively. The prior distribution, the starting values and the variances of the proposal distributions are automatically chosen, but these can be set by the user (see below). ```{r vign14,echo = F} data("rainfall_ggpd") ``` ```{r vign13, eval = F} rainfall_ggpd <- fggpd(rainfall, it = 50000, burn = 10000, thin = 40) rainfall_ggpd ``` The `print` method for the object `model1` gives us an overview of the estimation process, stating the model fitted, its log-likelihood, the posterior estimate of the shape parameter $\xi$ of the GPD and the probability that the distribution is right-unbounded. Additional details can be gathered using the `summary` function. ```{r vign12} summary(rainfall_ggpd) ``` The `summary` reports the posterior estimates as well as 95% posterior credibility intervals of the models' parameters. The threshold is estimated at 12.88 and the GPD is therefore estimated over a proportion of the data equal to `r length(rainfall[rainfall> 12.88])/length(rainfall)`. `extrememix` includes the function `check_convergence` which reports the traceplot and the auto-correlation plot for the estimated 0.99 quantile. It can be used as a quick check to ensure convergence of the estimation algorithm. Other R packages can be used for more in-depth analyses. According to the output, the estimation of the quantile is stable and therefore it is likely that the chain reached convergence. ```{r vign11,fig.align = "center", out.width = "50%"} check_convergence(rainfall_ggpd) ``` As an alternative model we consider the MGPD (mixture of gammas bulk/GPD tail) with 2 mixture components. It can be fitted using the function `fmgpd`, which needs as input also the number of components `k`. In this case we fully specify the model and the estimation procedure by also giving the starting values, the variances and the prior distribution. The starting values can be set creating a list with entries `xi`, `sigma`, `u`, `mu`, `eta` and `w`. The proposal variances can be set creating a list with entries `xi`, `sigma`, `u`, `mu` and `w` (for each mixture component the parameters $\mu$ and $\eta$ are sampled jointly). The prior distribution can be set creating a list with entries `u` (a vector with the mean and standard deviation of the prior normal distribution for $u$), `mu_mu` (a vector with the prior means of the Gamma distributions for $\mu$), `mu_eta` (a vector with the prior shapes of the Gamma distributions for $\mu$), `eta_mu` (a vector with the prior means of the Gamma distributions for $\eta$) and `eta_eta` (a vector with the prior shapes of the Gamma distributions for $\eta$). ```{r vign10,eval = F} start <- list(xi = 0.2, sigma = 5, u = quantile(rainfall,0.9), mu = c(4,10), eta = c(1,4), w = c(0.5,0.5)) var <- list(xi = 0.001, sigma = 1, u = 2, mu = c(0.1,0.1), w = 0.1) prior <- list(u = c(22,5), mu_mu = c(4,16), mu_eta = c(0.001,0.001), eta_mu = c(1,4), eta_eta = c(0.001,0.001)) rainfall_mgpd <- fmgpd(rainfall, k =2, it = 50000, burn = 10000, thin = 40, start = start, var = var, prior = prior) ``` ```{r vign9, echo = F} data("rainfall_mgpd") ``` The summary below shows that an MGPD model is not actually required since the estimate of the weight of one of the two components is zero. ```{r vign8} rainfall_mgpd summary(rainfall_mgpd) ``` Let's anyway check the convergence of the algorithm to ensure the estimation process went ok. ```{r vign7, out.width = "50%", fig.align = "center"} check_convergence(rainfall_mgpd) ``` Since the MGPD model has one weight equal to zero, a GGPD is recommended. We can anyway check that this is the case using model selection criteria. `extrememix` implements the AIC, AICc, BIC, DIC and WAIC criteria in the equally-named functions. ```{r vign6} rbind(c(BIC(rainfall_ggpd),BIC(rainfall_mgpd)),c(DIC(rainfall_ggpd),DIC(rainfall_mgpd)),c(WAIC(rainfall_ggpd),WAIC(rainfall_mgpd))) ``` For simplicity, here we considered three model selection criteria. BIC, DIC and WAIC all favor the GGPD. As already noticed in the literature, the use of WAIC is recommended and indeed it selects the GGPD model. We therefore next investigate the use of the GGPD model to assess rainfall in the city of Madrid. The `plot` method gives an overview of the model reporting the histogram of the distributions of $\xi$ and $\sigma$, a plot of the quantiles and a plot of the predictive distribution. ```{r vign5, fig.align = "center", out.width = "50%"} plot(rainfall_ggpd) ``` The predictive distribution can be further obtained using the function `pred`. The plot shows that the model gives a faithful description of the tail of the data. ```{r vign4, out.width = "50%",fig.align = "center"} pred(rainfall_ggpd) ``` In extreme value analysis there are many measures that are used to quantify risk: quantiles (implemented in `quant`), return levels (in `return_level`), Value-at-Risk (in `VaR`), Expected shortfall (in `ES`) and Tail VaR (in `TVaR`). For instance here we compute the return levels, i.e. the value that is expected to be equaled or exceeded on average once every interval of time (T). ```{r vign3, out.width = "50%",fig.align = "center"} return_level(rainfall_ggpd) plot(return_level(rainfall_ggpd)) ``` From the output we can see that we expect a rainfall of 30.42 mms to be equaled or exceeded every 20 months. The width of the credibility intervals can be chosen with the `cred` input and the values at which to compute the return levels can be chosen with `values`. As a different measure, we consider next the expected shortfall, defined as the expected value in the q% of the worst cases. For instance, the code below computes the 1% expected shortfall. ```{r vign2, out.width = "50%",fig.align = "center"} ES(rainfall_ggpd, values = 1) plot(ES(rainfall_ggpd, values = 1)) ``` In other words, conditional on observing a value above the 0.99 quantile, the expected rainfall is equal to 44.46mms. Since we selected a unique `values` the plotting method reports the posterior histogram of the estimated quantity. To conclude the analysis, we can further estimate what is the largest possible rainfall that could ever be observed in the city of Madrid, since we observed that $\xi$ was often estimated as negative. This can be done with the function `upper_bound`. ```{r vign1, out.width= "50%",fig.align = "center"} upper_bound(rainfall_ggpd) plot(upper_bound(rainfall_ggpd), xlim = c(20,400)) ``` The maximum rainfall that could be observed in Madrid is estimated as `r round(unname(median(upper_bound(rainfall_ggpd)$bound)),2)`. Furthermore, since in the posterior sample there are some values of $\xi$ which are positive, we have a non-zero probability that the distribution is right-unbounded. The limits of the histogram are set with the input `xlim`.