\documentclass[nojss]{jss}
%% need no \usepackage{Sweave.sty}
\usepackage{amsmath}

%\VignetteIndexEntry{LaplacesDemon Tutorial}
%\VignettePackage{LaplacesDemon}
%\VignetteDepends{LaplacesDemon}

\author{Statisticat, LLC}
\title{\includegraphics[height=1in,keepaspectratio]{LDlogo} \\ \pkg{LaplacesDemon}: A Complete Environment for Bayesian Inference within \proglang{R}}

%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Statisticat} %% comma-separated
\Plaintitle{LaplacesDemon: A Complete Environment for Bayesian Inference within R} %% without formatting
\Shorttitle{LaplacesDemon} %% a short title (if necessary)

\Abstract{
\pkg{LaplacesDemon}, also referred to as LD, is a contributed \proglang{R} package for Bayesian inference, and is freely available at \url{https://web.archive.org/web/20141224051720/http://www.bayesian-inference.com/indexe}. The user may build any kind of probability model with a user-specified model function. The model may be updated with iterative quadrature, Laplace Approximation, MCMC, PMC, or variational Bayes. After updating, a variety of facilities are available, including MCMC diagnostics, posterior predictive checks, and validation. Hopefully, \pkg{LaplacesDemon} is generalizable and user-friendly for Bayesians, especially Laplacians.
}
\Keywords{Bayesian, Big Data, High Performance Computing, HPC, Importance Sampling, Iterative Quadrature, Laplace Approximation, LaplacesDemon, LaplacesDemonCpp, Markov chain Monte Carlo, MCMC, Metropolis, Optimization, Parallel, PMC, R, Rejection Sampling, Variational Bayes}
\Plainkeywords{bayesian, big data, high performance computing, hpc, importance sampling, iterative quadrature, laplace approximation, laplacesdemon, laplacesdemoncpp, markov chain monte carlo, mcmc, metropolis, optimization, parallel, pmc, r, rejection sampling, variational bayes}

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{13}
%% \Issue{9}
%% \Month{September}
%% \Year{2011}
%% \Submitdate{2011-01-18}
%% \Acceptdate{2011-01-18}

\Address{
 Statisticat, LLC\\
 Hot Springs, SD\\
 E-mail: defunct\\
 URL: \url{https://web.archive.org/web/20141224051720/http://www.bayesian-inference.com/index}
}

%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

\begin{document}
Bayesian inference is named after Reverend Thomas Bayes (1701-1761) for developing Bayes' theorem, which was published posthumously after his death \citep{bayes63}. This was the first instance of what would be called inverse probability\footnote{`Inverse probability' refers to assigning a probability distribution to an unobserved variable, and is in essence, probability in the opposite direction of the usual sense. Bayes' theorem has been referred to as ``the principle of inverse probability''. Terminology has changed, and the term `Bayesian probability' has displaced `inverse probability'. The adjective ``Bayesian'' was introduced by R. A. Fisher as a derogatory term.}.

Unaware of Bayes, Pierre-Simon Laplace (1749-1827) independently developed Bayes' theorem and first published his version in 1774, eleven years after Bayes, in one of Laplace's first major works \citep[p. 366--367]{laplace74}. In 1812, Laplace introduced a host of new ideas and mathematical techniques in his book, \emph{Theorie Analytique des Probabilites} \citep{laplace12}. Before Laplace, probability theory was solely concerned with developing a mathematical analysis of games of chance. Laplace applied probabilistic ideas to many scientific and practical problems. Although Laplace is not the father of probability, Laplace may be considered the father of the field of probability.

In 1814, Laplace published his ``Essai Philosophique sur les Probabilites'', which introduced a mathematical system of inductive reasoning based on probability \citep{laplace14}. In it, the Bayesian interpretation of probability was developed independently by Laplace, much more thoroughly than Bayes, so some ``Bayesians'' refer to Bayesian inference as Laplacian inference. This is a translation of a quote in the introduction to this work:

\begin{quote}
``We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes'' \citep{laplace14}.
\end{quote}

The `intellect' has been referred to by future biographers as Laplace's Demon. In this quote, Laplace expresses his philosophical belief in hard determinism and his wish for a computational machine that is capable of estimating the universe.

This article is an introduction to an \proglang{R} \citep{rdct:r} package called \pkg{LaplacesDemon} \citep{r:laplacesdemon}, which was designed without consideration for hard determinism, but instead with a lofty goal toward facilitating high-dimensional Bayesian (or Laplacian) inference\footnote{Even though the \pkg{LaplacesDemon} package is dedicated to Bayesian inference, frequentist inference may be used instead with the same functions by omitting the prior distributions and maximizing the likelihood.}, posing as its own intellect that is capable of impressive analysis. The \pkg{LaplacesDemon} \proglang{R} package is often referred to as LD. This article guides the user through installation, data, specifying a model, initial values, updating a numerical approximation algorithm, summarizing and plotting output, posterior predictive checks, general suggestions, discusses independence and observability, high performance computing, covers details of the algorithms, and introduces \url{https://web.archive.org/web/20141224051720/http://www.bayesian-inference.com/index}.

Herein, it is assumed that the reader has basic familiarity with Bayesian inference, numerical approximation, and \proglang{R}. If any part of this assumption is violated, then suggested sources include the vignette entitled ``Bayesian Inference'' that comes with the \pkg{LaplacesDemon} package, \citet{robert07}, and \citet{crawley07}.

\section{Installation} \label{installation}

To obtain the \pkg{LaplacesDemon} package, simply download the source code from \url{https://web.archive.org/web/20141224051720/http://www.bayesian-inference.com/softwaredownload}, open \proglang{R}, and install the \pkg{LaplacesDemon} package from source:

%%\SweaveOpts{echo=TRUE,results=verbatim,fig=FALSE}
\begin{Scode}{eval=FALSE}
install.packages(pkgs="path/LaplacesDemon_ver.tar.gz", repos=NULL, type="source")
\end{Scode}

where \code{path} is a path to the zipped source code, and \code{_ver} is replaced with the latest version found in the name of the downloaded file.

A goal in developing the \pkg{LaplacesDemon} package was to minimize reliance on other packages or software. Therefore, the usual \code{dep=TRUE} argument does not need to be used, because the \pkg{LaplacesDemon} package does not depend on anything other than base \proglang{R} and its \pkg{parallel} package. \pkg{LaplacesDemonCpp} is an extension package that uses \proglang{C++}, and imports these packages: \pkg{parallel}, \pkg{Rcpp}, and \pkg{RcppArmadillo}. This tutorial introduces only \pkg{LaplacesDemon}, but the use of \pkg{LaplacesDemonCpp} is identical. Once installed, simply use the \code{library} or \code{require} function in \proglang{R} to activate the \pkg{LaplacesDemon} package and load its functions into memory:

\begin{Scode}
library(LaplacesDemon)
\end{Scode}

\section{Data} \label{data}

The \pkg{LaplacesDemon} package requires data that is specified in a list\footnote{Though most \proglang{R} functions use data in the form of a data frame, \pkg{LaplacesDemon} uses one or more numeric matrices in a list. It is much faster to process a numeric matrix than a data frame in iterative estimation.}. As an example, there is a data set called \code{demonsnacks} that is provided with the \pkg{LaplacesDemon} package. For no good reason, other than to provide an example, the log of \code{Calories} will be fit as an additive, linear function of the log of some of the remaining variables. Since an intercept will be included, a vector of 1's is inserted into design matrix \textbf{X}.

\begin{Scode}
data(demonsnacks)
N <- nrow(demonsnacks)
y <- log(demonsnacks$Calories)
X <- cbind(1, as.matrix(log(demonsnacks[,c(1,4,10)]+1)))
J <- ncol(X)
for (j in 2:J) {X[,j] <- CenterScale(X[,j])}
mon.names <- "LP"
parm.names <- as.parm.names(list(beta=rep(0,J), sigma=0))
pos.beta <- grep("beta", parm.names)
pos.sigma <- grep("sigma", parm.names)
PGF <- function(Data) {
     beta <- rnorm(Data$J)
     sigma <- runif(1)
     return(c(beta, sigma))
     }
MyData <- list(J=J, PGF=PGF, X=X, mon.names=mon.names, 
     parm.names=parm.names, pos.beta=pos.beta, pos.sigma=pos.sigma, y=y)
\end{Scode}

There are J=\Sexpr{J} independent variables (including the intercept), one for each column in design matrix \textbf{X}. However, there are \Sexpr{J+1} parameters, since the residual variance, $\sigma^2$, must be included as well. Each parameter must have a name specified in the vector \code{parm.names}, and parameter names must be included with the data. This is using a function called \code{as.parm.names}. Also, note that each predictor has been centered and scaled, as per \citet{gelman08}. A \code{CenterScale} function is provided to center and scale predictors\footnote{Centering and scaling a predictor is \code{x.cs <- (x - mean(x)) / (2*sd(x))}.}.

\code{PGF} is an optional, but highly recommended, user-specified function. PGF stands for Parameter-Generating Function, and is used by the \code{GIV} function, where GIV stands for Generating Initial Values. Although the \code{PGF} is not technically data, it is most conveniently placed in the list of data. When \code{PGF} is not specified and \code{GIV} is used, initial values are generated randomly without respect to prior distributions. To see why \code{PGF} was specified as it was, consider the following sections on specifying a model and initial values.

\section{Specifying a Model} \label{specification}

The \pkg{LaplacesDemon} package is capable of estimating any Bayesian model for which the likelihood is specified\footnote{Examples of more than 100 Bayesian models may be found in the ``Examples'' vignette that comes with the \pkg{LaplacesDemon} package. Likelihood-free estimation is also possible by approximating the likelihood, such as in Approximate Bayesian Computation (ABC).}. To use the \pkg{LaplacesDemon} package, the user must specify a model. Let's consider a linear regression model, which is often denoted as:

$$\textbf{y} \sim \mathcal{N}(\mu, \sigma^2)$$
$$\mu = \textbf{X}\beta$$

The dependent variable, $\textbf{y}$, is normally distributed according to expectation vector $\mu$ and scalar variance $\sigma^2$, and expectation vector $\mu$ is equal to the inner product of design matrix \textbf{X} and transposed parameter vector $\beta$.

For a Bayesian model, the notation for the residual variance, $\sigma^2$, has often been replaced with the inverse of the residual precision, $\tau^{-1}$. Here, $\sigma^2$ will be used. Prior probabilities are specified for $\beta$ and $\sigma$ (the standard deviation, rather than the variance):

$$\beta_j \sim \mathcal{N}(0, 1000), \quad j=1,\dots,J$$
$$\sigma \sim \mathcal{HC}(25)$$

Each of the $J$ $\beta$ parameters is assigned a vague\footnote{`Traditionally, a vague prior would be considered to be under the class of uninformative or non-informative priors. 'Non-informative' may be more widely used than 'uninformative', but here that is considered poor English, such as saying something is `non-correct' when there's a word for that \dots `incorrect'. In any case, uninformative priors do not actually exist \citep{irony97}, because all priors are informative in some way. These priors are being described here as vague, but not as uninformative.} prior probability distribution that is normally-distributed according to $\mu=0$ and $\sigma^2=1000$. The large variance or small precision indicates a lot of uncertainty about each $\beta$, and is hence a vague distribution. The residual standard deviation $\sigma$ is half-Cauchy-distributed according to its hyperparameter, scale=25. When exploring new prior distributions, the user is encouraged to use the \code{is.proper} function to check for prior propriety.

To specify a model, the user must create a function called \code{Model}. Here is an example for a linear regression model written in \proglang{R} code\footnote{A model specification function for the \pkg{LaplacesDemon} or \pkg{LaplacesDemonCpp} packages may be written and compiled in a faster language, such as in \proglang{C++} via the \pkg{Rcpp} package family.}:

\begin{Scode}
Model <- function(parm, Data)
     {
     ### Parameters
     beta <- parm[Data$pos.beta]
     sigma <- interval(parm[Data$pos.sigma], 1e-100, Inf)
     parm[Data$pos.sigma] <- sigma
     ### Log-Prior
     beta.prior <- dnormv(beta, 0, 1000, log=TRUE)
     sigma.prior <- dhalfcauchy(sigma, 25, log=TRUE)
     ### Log-Likelihood
     mu <- tcrossprod(beta, Data$X)
     LL <- sum(dnorm(Data$y, mu, sigma, log=TRUE))
     ### Log-Posterior
     LP <- LL + sum(beta.prior) + sigma.prior
     Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
          yhat=rnorm(length(mu), mu, sigma), parm=parm)
     return(Modelout)
     }
\end{Scode}

A numerical approximation algorithm iteratively maximizes the logarithm of the unnormalized joint posterior density as specified in this \code{Model} function. In Bayesian inference, the logarithm of the unnormalized joint posterior density is proportional to the sum of the log-likelihood and logarithm of the prior densities:

$$\log[p(\Theta|\textbf{y})] \propto \log[p(\textbf{y}|\Theta)] + \log[p(\Theta)]$$

where $\Theta$ is a set of parameters, $\textbf{y}$ is the data, $\propto$ means `proportional to'\footnote{For those unfamiliar with $\propto$, this symbol simply means that two quantities are proportional if they vary in such a way that one is a constant multiplier of the other. This is due to an unspecified constant of proportionality in the equation. Here, this can be treated as `equal to'.}, $p(\Theta|\textbf{y})$ is the joint posterior density, $p(\textbf{y}|\Theta)$ is the likelihood, and $p(\Theta)$ is the set of prior densities.

During each iteration in which a numerical approximation algorithm is maximizing the logarithm of the unnormalized joint posterior density, two arguments are passed to \code{Model}: \code{parm} and \code{Data}, where \code{parm} is short for the set of parameters, and \code{Data} is a list of data. These arguments are specified in the beginning of the function:

\code{Model <- function(parm, Data)} 

Then, the \code{Model} function is evaluated and the logarithm of the unnormalized joint posterior density is calculated as \code{LP}, and returned in a list called \code{Modelout}, along with the deviance (\code{Dev}), a vector (\code{Monitor}) of any variables desired to be monitored in addition to the parameters, $\textbf{y}^{rep}$ (\code{yhat}) or replicates of $\textbf{y}$, and the parameter vector \code{parm}. All arguments must be returned. Even if there is no desire to observe the deviance and any monitored variable, a scalar must be placed in the second position of the \code{Modelout} list, and at least one element of a vector for a monitored variable. This can be seen in the end of the function:

\code{LP <- LL + sum(beta.prior) + sigma.prior} \\
\code{Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,} \\
\hspace*{0.27 in} \code{yhat=rnorm(length(mu), mu, sigma), parm=parm)} \\
\code{return(Modelout)}

The rest of the function specifies the parameters, log of the prior densities, and calculates the log-likelihood. Since design matrix \textbf{X} has J=\Sexpr{J} column vectors (including the intercept), there are \Sexpr{J} \code{beta} parameters and a \code{sigma} parameter for the residual standard deviation.

Since a vector of parameters called \code{parm} is passed to \code{Model}, the function needs to know which parameter is associated with which element of \code{parm}. For this, the vector \code{beta} is declared, and then each element of \code{beta} is populated with the value associated in the corresponding element of \code{parm}. Above, the \code{grep} function was used to populate \code{pos.beta} and \code{pos.sigma}, which indicate the positions of $\beta$ and $\sigma$. These positions are stored in the list of data, and used in the \code{Model} function to extract the appropriate parameters from vector \code{parm}:

\code{beta <- parm[Data$pos.beta} \\
\code{sigma <- interval(parm[Data$pos.sigma], 1e-100, Inf)} \\
\code{parm[Data$pos.sigma] <- sigma}

The $\sigma$ parameter must be positive-only, and so it is constrained to be positive in the \code{interval} function. The algorithm, outside of the \code{Model} function needs to be aware that $\sigma$ has been constrained, so the \code{parm} vector is updated with the constrained value.

The user does not have to constrain parameters in this way. For example, an alternative is to reparameterize to real values, such as with a logarithm, in this case. If the user does not constrain or reparameterize a parameter that is not on the real line, then the algorithm will be unaware, and probably attempt a value outside of realistic bounds, such as a negative standard deviation in this example.

To work with the log of the prior densities and according to the assigned names of the parameters and hyperparameters, they are specified as follows:

\code{beta.prior <- dnormv(beta, 0, 1000, log=TRUE)} \\
\code{sigma.prior <- dhalfcauchy(sigma, 25, log=TRUE)}

In the above example, the residual standard deviation \code{sigma} receives a half-Cauchy distributed prior of the form:

$$\sigma \sim \mathcal{HC}(25)$$

Finally, everything is put together to calculate \code{LP}, the logarithm of the unnormalized joint posterior density. The expectation vector \code{mu} is the inner product of the design matrix, \code{Data$X}, and the transpose of the vector \code{beta}. Expectation vector \code{mu}, vector \code{Data$y}, and scalar \code{sigma} are used to estimate the sum of the log-likelihoods, where:

$$\textbf{y} \sim \mathcal{N}(\mu, \sigma^2)$$

and as noted before, the logarithm of the unnormalized joint posterior density is:

$$\log[p(\Theta|\textbf{y})] \propto \log[p(\textbf{y}|\Theta)] + \log[p(\Theta)]$$

\code{mu <- tcrossprod(Data$X, t(beta))} \\
\code{LL <- sum(dnorm(Data$y, mu, sigma, log=TRUE)} \\
\code{LP <- LL + sum(beta.prior) + sigma.prior}

In retrospect, the \code{PGF} function was specified so that when the list of data is passed to it, it generates and returns an initial value for each of the \code{beta} parameters, as well as one for the \code{sigma} parameter.

Specifying the model in the \code{Model} function is the most involved aspect for the user of the \pkg{LaplacesDemon} package. But this package has been designed so it is also incredibly flexible, allowing a wide variety of Bayesian models to be specified.

\section{Initial Values} \label{initialvalues}

Each numerical approximation algorithm in the \pkg{LaplacesDemon} package requires a vector of initial values for the parameters. Each initial value is a starting point for the estimation of a parameter. In this example, there are \Sexpr{J+1} parameters. The order of the elements of the vector of initial values must match the order of the parameters associated with each element of \code{parm} passed to the \code{Model} function. With no prior knowledge, it is a good idea to randomize each initial value, such as with the \code{GIV} function (which stands for ``generate initial values'').

When all initial values are set to zero for MCMC, the \code{LaplacesDemon} function optimizes initial values using a spectral projected gradient algorithm in the \code{LaplaceApproximation} function. Laplace Approximation is asymptotic with respect to sample size, so it is inappropriate in this example with a sample size of \Sexpr{N} and \Sexpr{J+1} parameters. MCMC will not use Laplace Approximation when the sample size is not at least five times the number of parameters.

\begin{Scode}
Initial.Values <- c(rep(0,J), 1)
\end{Scode}

\section{Numerical Approximation} \label{numericalapproximation}

Compared to specifying the model in the \code{Model} function, updating a model is easy. Since pseudo-random numbers are involved, it's a good idea to set a `seed' for pseudo-random number generation, so results can be reproduced. Pick any number you like, but there's only one number appropriate for a demon\footnote{Demonic references are used only to add flavor to the software and its use, and in no way endorses beliefs in demons. This specific pseudo-random seed is often referred to, jokingly, as the `demon seed'.}:

\begin{Scode}
set.seed(666)
\end{Scode}

The \pkg{LaplacesDemon} package offers a wide variety of numerical approximation algorithms. Details may be found below in section \ref{details}, and also in the appropriate function documentation. If the user is new to Bayesian inference, then the best suggestion may be to consider Laplace Approximation with the \code{LaplaceApproximation} function when sufficient sample size is available, or MCMC with the \code{LaplacesDemon} function otherwise. This guideline is too simple, but serves as a place to start. For this example, the \code{LaplacesDemon} function will be used.

As with any \proglang{R} package, the user can learn about a function by using the \code{help} function and including the name of the desired function. To learn the details of the \pkg{LaplacesDemon} function, enter:

\begin{Scode}{eval=false}
help(LaplacesDemon) 
\end{Scode}

Here is one of many possible ways to begin:

\begin{Scode}{eval=false}
Fit <- LaplacesDemon(Model, Data=MyData, Initial.Values,
     Covar=NULL, Iterations=1000, Status=100, Thinning=1,
     Algorithm="AFSS", Specs=list(A=500, B=NULL, m=100, n=0, w=1))
\end{Scode}

In this example, an output object called \code{Fit} will be created as a result of using the \pkg{LaplacesDemon} function. \code{Fit} is an object of class \code{demonoid}, which means that since it has been assigned a customized class, other functions have been custom-designed to work with it. The above example specifies the AFSS algorithm for updating.

This example tells the \pkg{LaplacesDemon} function to maximize the first component in the list output from the user-specified \code{Model} function, given a data set called \code{Data}, and according to several settings.
\begin{itemize}
\item The \code{Initial.Values} argument requires a vector of initial values for the parameters.
\item The \code{Covar=NULL} argument indicates that a user-specified variance vector or covariance matrix has not been supplied. AFSS requires proposal covariance, and when not specified, will begin with a scaled identity matrix.
\item The \code{Iterations=1000} argument indicates that the \code{LaplacesDemon} function will update 1,000 times before completion.
\item The \code{Status=100} argument indicates that a status message will be printed to the \proglang{R} console every 100 iterations.
\item The \code{Thinning=1} argument indicates that only ever $K$th iteration will be retained in the output, and in this case, every iteration will be retained. See the \code{Thin} function for more information on thinning.
\item The \code{Algorithm} argument requires the abbreviated name of the MCMC algorithm in quotes.
\item Finally, the \code{Specs} argument contains specifications for each algorithm named in the \code{Algorithm} argument. The \code{AFSS} algorithm has several specifications. The \code{A} specification indicates at which iteration adaptation will stop, and it is arbitrarily set here so that it adapts for the first half, and is non-adaptive in the second half. The \code{B} specification is for blockwise sampling, which is not performed here. The \code{m} specification indicates the maximum number of steps when searching for the slice interval. The \code{n} specification is set to zero and indicates the number of previous adaptive iterations. The \code{w} specification is the step-size, which is adapted in this algorithm.
\end{itemize}

By running\footnote{This is ``turning the Bayesian crank'', as Dennis Lindley used to say.} the \code{LaplacesDemon} function, the following output was obtained:

\begin{Scode}
Fit <- LaplacesDemon(Model, Data=MyData, Initial.Values,
     Covar=NULL, Iterations=1000, Status=100, Thinning=1,
     Algorithm="AFSS", Specs=list(A=500, B=NULL, m=100, n=0, w=1))
\end{Scode}

\code{LaplacesDemon} finished quickly, though it had a small data set (N=\Sexpr{NROW(X)}), few parameters (K=\Sexpr{J+1}), and the model was very simple. The output object, \code{Fit}, was created as a list. As with any \proglang{R} object, use \code{str()} to examine its structure:

\begin{Scode}{eval=false}
str(Fit)
\end{Scode}

To access any of these values in the output object \code{Fit}, simply append a dollar sign and the name of the component. For example, here is how to access the observed acceptance rate:

\begin{Scode}
Fit$Acceptance.Rate
\end{Scode}

\subsection{Warnings} \label{warnings}

During updating in \code{LaplacesDemon}, warnings are converted to errors, and the proposal is rejected. Warnings may appear due to checks before updating, or summarizing after updating, but not during updating. If chains appear to have numerous rejections after trying a variety of samplers, then the model specification function may be producing warnings with certain configurations of parameters. If warnings continue to occur, then the priors or parameterization should be considered. An example is when a scale parameter for the posterior predictive distribution is allowed to be too small or large.

\section{Summarizing Output} \label{summarizingoutput}

The output object, \code{Fit}, has many components. The (copious) contents of \code{Fit} can be printed to the screen with the usual \proglang{R} functions:

\begin{Scode}{eval=false}
Fit
print(Fit)
\end{Scode}

While a user is welcome to continue this \proglang{R} convention, the \pkg{LaplacesDemon} package adds another feature below the \code{print} function output in the \code{Consort} function. But before describing the additional feature, the results are obtained as:

\begin{Scode}
Consort(Fit)
\end{Scode}

Several components are labeled as \code{NOT SHOWN HERE}, due to their size, such as the covariance matrix \code{Covar} or the stationary posterior samples \code{Posterior2}. As usual, these can be printed to the screen by appending a dollar sign, followed by the desired component, such as:

\begin{Scode}{eval=false}
Fit$Posterior2
\end{Scode}

Although a lot can be learned from the above output, notice that it completed \Sexpr{Fit$Iterations} iterations of \Sexpr{J+1} variables in \Sexpr{round(Fit$Minutes,2)} minutes. Of course this was fast, since there were only \Sexpr{NROW(X)} records, and the form of the specified model was simple.

In \proglang{R}, there is usually a \code{summary} function associated with each class of output object. The \code{summary} function usually summarizes the output. For example, with frequentist models, the \code{summary} function usually creates a table of parameter estimates, complete with p-values.

Since this is not a frequentist package, p-values are not part of any table with the \code{LaplacesDemon} function, and the marginal posterior distributions of the parameters and other variables have already been summarized in \code{Fit}, there is no point to have an associated \code{summary} function. Going one more step toward useability, the \code{Consort} function of \pkg{LaplacesDemon} allows the user to consort with Laplace's Demon about the output object.

The additional feature is a second section called \code{Demonic Suggestion}. The \code{Demonic Suggestion} is a very helpful section of output. When the \pkg{LaplacesDemon} package was developed initially in late 2010, there were not to my knowledge any tools of Bayesian inference that make suggestions to the user.

Before making its \code{Demonic Suggestion}, Laplace's Demon considers and presents five conditions: the algorithm, acceptance rate, Monte Carlo standard error (MCSE), effective sample size (ESS), and stationarity. In addition to these conditions, there are other suggested values, such as a recommended number of iterations or values for the \code{Periodicity} and \code{Status} arguments. The suggested value for \code{Status} is seeking to print a status message every minute when the expected time is longer than a minute, and is based on the time in minutes it took, the number of iterations, and the recommended number of iterations.

In the above output, Laplace's Demon is appeased. However, if any of these five conditions is unsatisfactory, then Laplace's Demon is not appeased, and suggests it should continue updating, and that the user should copy, paste, and execute its suggested \proglang{R} code. Here are the criteria it measures against. The final algorithm must be non-adaptive, so that the Markov property holds (this is covered in section \ref{markovchainproperties}). The acceptance rate of most algorithms is considered satisfactory if it is within the interval [15\%, 50\%]\footnote{While \citet{spiegelhalter03} recommend updating until the acceptance rate is within the interval [20\%, 40\%], and \citet{roberts01} suggest [10\%, 40\%], the interval recommended here is [15\%,50\%]. HMC and Refractive must be in the interval [60\%, 70\%].}. LMC or MALA must be in the interval [40\%, 80\%], and others (AFSS, AGG, ESS, GG, OHSS, SGLD, Slice, and UESS) have an acceptance rate of 100\%. For more information on acceptance rates, see the \code{AcceptanceRate} function. MCSE is considered satisfactory for each target distribution if it is less than 6.27\% of the standard deviation of the target distribution. This allows the true mean to be within 5\% of the area under a Gaussian distribution around the estimated mean. ESS is considered satisfactory for each target distribution if it is at least 100, which is usually enough to describe 95\% probability intervals. And finally, each variable must be estimated as stationary.

In this example, notice that all criteria have been met: MCSEs are sufficiently small, ESSs are sufficiently large, and all parameters were estimated to be stationary. Although the algorithm adapted in the first half, it was non-adaptive in the second half of the run, the Markov property holds, so let's look at some plots.

\section{Plotting Output} \label{plottingoutput}

The \pkg{LaplacesDemon} package has a \code{plot.demonoid} function to enable its own customized plots with \code{demonoid} objects. The variable \code{BurnIn} (below) may be left as it is so it will show only the stationary samples (samples that are no longer trending), or set equal to zero so that all samples can be plotted. In this case, only samples are considered that were generated while the algorithm was non-adaptive, so \code{BurnIn=500}.

The \code{plot} function also enables the user to specify whether or not the plots should be saved as a .pdf file, and allows the user to select the parameters to be plotted. For example, \code{Parms=c("beta[1]","beta[2]")} would plot only the first two regression effects, and \code{Parms=NULL} will plot everything.

\begin{Scode}
\end{Scode}
\begin{Scode}{eval=false}
plot(Fit, BurnIn=500, MyData, PDF=FALSE, Parms=NULL)
\end{Scode}

%% Control graphic size, default width=0.8
\setkeys{Gin}{width=0.5\textwidth}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig1,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(3,3))
BurnIn <- 500
for (j in 1:3){
     plot((BurnIn+1):Fit$Thinned.Samples,
          Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j],
          type="l", xlab="Thinned Samples", ylab="Value",
          main=MyData$parm.names[j])
     panel.smooth((BurnIn+1):Fit$Thinned.Samples, 
          Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j], pch="")
     plot(density(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]),
          xlab="Value", main=MyData$parm.names[j])
     polygon(density(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]),
          col="black", border="black")
     abline(v=0, col="red", lty=2)
     z <- acf(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j], plot=FALSE)
     se <- 1/sqrt(length(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]))
     plot(z$lag, z$acf, ylim=c(min(z$acf,-2*se),1), type="h",
          main=MyData$parm.names[j], xlab="Lag", ylab="Correlation")
     abline(h=(2*se), col="red", lty=2)
     abline(h=(-2*se), col="red", lty=2)
     }
\end{Scode}
\end{center}
\caption{Plots of Marginal Posterior Samples}
\end{figure}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig2,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(3,3))
for (j in 4:5){
     plot((BurnIn+1):Fit$Thinned.Samples,
          Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j],
          type="l", xlab="Thinned Samples", ylab="Value",
          main=MyData$parm.names[j])
     panel.smooth((BurnIn+1):Fit$Thinned.Samples, 
          Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j], pch="")
     plot(density(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]),
          xlab="Value", main=MyData$parm.names[j])
     polygon(density(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]),
          col="black", border="black")
     abline(v=0, col="red", lty=2)
     z <- acf(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j], plot=FALSE)
     se <- 1/sqrt(length(Fit$Posterior1[(BurnIn+1):Fit$Thinned.Samples,j]))
     plot(z$lag, z$acf, ylim=c(min(z$acf,-2*se),1), type="h",
          main=MyData$parm.names[j], xlab="Lag", ylab="Correlation")
     abline(h=(2*se), col="red", lty=2)
     abline(h=(-2*se), col="red", lty=2)
     }
plot((BurnIn+1):length(Fit$Deviance),
     Fit$Deviance[(BurnIn+1):length(Fit$Deviance)],
     type="l", xlab="Thinned Samples", ylab="Value", main="Deviance")
panel.smooth((BurnIn+1):length(Fit$Deviance), 
     Fit$Deviance[(BurnIn+1):length(Fit$Deviance)], pch="")
plot(density(Fit$Deviance[(BurnIn+1):length(Fit$Deviance)]),
     xlab="Value", main="Deviance")
polygon(density(Fit$Deviance[(BurnIn+1):length(Fit$Deviance)]),
     col="black", border="black")
abline(v=0, col="red", lty=2)
z <- acf(Fit$Deviance[(BurnIn+1):length(Fit$Deviance)], plot=FALSE)
se <- 1/sqrt(length(Fit$Deviance[(BurnIn+1):length(Fit$Deviance)]))
plot(z$lag, z$acf, ylim=c(min(z$acf,-2*se),1), type="h",
     main="Deviance", xlab="Lag", ylab="Correlation")
abline(h=(2*se), col="red", lty=2)
abline(h=(-2*se), col="red", lty=2)
\end{Scode}
\end{center}
\caption{Plots of Marginal Posterior Samples}
\end{figure}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig3,fig=TRUE,echo=FALSE,width=6,height=4}
par(mfrow=c(2,3))
JJ <- NCOL(Fit$Monitor); nn <- NROW(Fit$Monitor)
for (j in 1:JJ){
     plot((BurnIn+1):nn, 
          Fit$Monitor[(BurnIn+1):nn,j],
          type="l", xlab="Thinned Samples", ylab="Value", 
          main=MyData$mon.names[j])
     panel.smooth((BurnIn+1):nn, 
          Fit$Monitor[(BurnIn+1):nn,j], pch="")
     plot(density(Fit$Monitor[(BurnIn+1):nn,j]), xlab="Value", 
          main=MyData$mon.names[j])
     polygon(density(Fit$Monitor[(BurnIn+1):nn,j]),
          col="black", border="black")
     abline(v=0, col="red", lty=2)
     z <- acf(Fit$Monitor[(BurnIn+1):nn,j], plot=FALSE)
     se <- 1/sqrt(length(Fit$Monitor[(BurnIn+1):nn,j]))
     plot(z$lag, z$acf, ylim=c(min(z$acf,-2*se),1), type="h",
          main=MyData$mon.names[j], xlab="Lag", ylab="Correlation")
     abline(h=(2*se), col="red", lty=2)
     abline(h=(-2*se), col="red", lty=2)
     }
\end{Scode}
\end{center}
\caption{Plots of Marginal Posterior Samples}
\end{figure}

There are three plots for each parameter, the deviance, and each monitored variable (which in this example are \code{LP} and \code{sigma}). The leftmost plot is a trace-plot, showing the history of the value of the parameter according to the iteration. The middlemost plot is a kernel density plot. The rightmost plot is an ACF or autocorrelation function plot, showing the autocorrelation at different lags. The chains look stationary (do not exhibit a trend), the kernel densities look Gaussian, and the ACF's show low autocorrelation.

The Hellinger distances between batches of chains can be plotted with

\begin{Scode}{eval=false}
plot(BMK.Diagnostic(Fit$Posterior1[501:1000,]))
\end{Scode}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig4,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(1,1))
plot(BMK.Diagnostic(Fit$Posterior1[501:1000,]))
\end{Scode}
\end{center}
\caption{Hellinger Distances}
\end{figure}

These distances occur in the interval $[0,1]$, and lower (darker) is better. The \code{LaplacesDemon} function considers any Hellinger distance greater than 0.5 to indicate non-stationarity and non-convergence. This plot is useful for quickly finding problematic parts of chains. All Hellinger distances here are acceptably small (dark).

Another useful plot is called the caterpillar plot, which plots a horizontal representation of three quantiles (2.5\%, 50\%, and 97.5\%) of each selected parameter from the posterior samples summary. The caterpillar plot will attempt to plot the stationary samples first (\code{Fit$Summary2}), but if stationary samples do not exist, then it will plot all samples (\code{Fit$Summary1}). Here, only the first four parameters are selected for a caterpillar plot:

\begin{Scode}{eval=false}
caterpillar.plot(Fit, Parms="beta")
\end{Scode}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig5,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(1,1))
caterpillar.plot(Fit, Parms=1:4)
\end{Scode}
\end{center}
\caption{Caterpillar Plot}
\end{figure}

If all is well, then the Markov chains should be studied with MCMC diagnostics (such as visual inspections with the \code{CSF} or Cumulative Sample Function), and finally, further assessments of model fit should be estimated with posterior predictive checks, showing how well (or poorly) the model fits the data. When the user is satisfied, the \code{BayesFactor} function may be useful in selecting the best model, and the marginal posterior samples may be used for inference.

\section{Posterior Predictive Checks} \label{ppc}

A posterior predictive check is a method to assess discrepancies between the model and the data \citep{gelman96a}. To perform posterior predictive checks with the the \pkg{LaplacesDemon} package, simply use the \code{predict} function:

\begin{Scode}
Pred <- predict(Fit, Model, MyData, CPUs=1)
\end{Scode}

This creates \code{Pred}, which is an object of class \code{demonoid.ppc} (where ppc is short for posterior predictive check). \code{Pred} is a list that contains three components: \code{y}, \code{yhat}, and \code{Deviance} (though the \code{LaplaceApproximation} output differs a little). If the data set that was used to estimate the model is supplied in \code{predict}, then replicates of \code{y} (also called $\textbf{y}^{rep}$) are estimated. If, instead, a new data set is supplied in \code{predict}, then new, unobserved instances of \code{y} (called $\textbf{y}^{new}$) are estimated. Note that with new data, a \code{y} vector must still be supplied, and if unknown, can be set to something sensible such as the mean of the \code{y} vector in the model.

The \code{predict} function calls the \code{Model} function once for each set of stationary samples in \code{Fit\$Posterior2}. When there are few discrepancies between \code{y} and $\textbf{y}^{rep}$, the model is considered to fit well to the data. Parallel processing is enabled when multiple CPUs exist and are specified.

Since \code{Pred\$yhat} is a large (39 x 1000) matrix, let's look at the summary of the posterior predictive distribution:

\begin{Scode}
summary(Pred, Discrep="Chi-Square")
\end{Scode}

The \code{summary.demonoid.ppc} function returns a list with 5 components when \code{y} is continuous (different output occurs for categorical dependent variables when given the argument \code{Categorical=TRUE}):
\begin{itemize}
\item \code{BPIC} is the Bayesian Predictive Information Criterion of \citet{ando07}. BPIC is a variation of the Deviance Information Criterion (DIC) that has been modified for predictive distributions. For more information on DIC, see the accompanying vignette entitled ``Bayesian Inference''.
\item \code{Concordance} is the predictive concordance of \citet{gelfand96}, that indicates the percentage of times that \code{y} was within the 95\% probability interval of \code{yhat}. A goal is to have 95\% predictive concordance. For more information, see the accompanying vignette entitled ``Bayesian Inference''. In this case, roughly \Sexpr{round(summary(Pred)$Concordance*100,0)}\% of the time, \code{y} is within the 95\% probability interval of \code{yhat}. These results suggest that the model should be attempted again under different conditions, such as using different predictors, or specifying a different form to the model.
\item \code{Discrepancy.Statistic} is a summary of a specified discrepancy measure. There are many options for discrepancy measures that may be specified in the \code{Discrep} argument. In this example, the specified discrepancy measure was the $\chi^2$ test in \citet[p. 175]{gelman04}, and higher values indicate a worse fit.
\item \code{L-criterion} is a posterior predictive check for model and variable selection that measures the distance between $\textbf{y}$ and $\textbf{y}^{rep}$, providing a criterion to be minimized \citep{laud95}.
\item The last part of the summarized output reports \code{y}, information about the distribution of \code{yhat}, and the predictive quantile (\code{PQ}). The mean prediction of \code{y[1]}, or $\textbf{y}^{rep}_1$, given the model and data, is \Sexpr{round(summary(Pred)$Summary[1,2],3)}. Most importantly, \code{PQ[1]} is \Sexpr{round(summary(Pred)$Summary[1,7],3)}, indicating that \Sexpr{round(summary(Pred)$Summary[1,7]*100,1)}\% of the time, \code{yhat[1,]} was greater than \code{y[1]}, or that \code{y[1]} is close to the mean of \code{yhat[1,]}. Contrast this with the 6th record, where \code{y[6]}=\Sexpr{round(summary(Pred)$Summary[6,1],3)} and \code{PQ[6]}=\Sexpr{round(summary(Pred)$Summary[6,7],3)}. Therefore, \code{yhat[6,]} was not a good replication of \code{y[6]}, because the distribution of \code{yhat[6,]} is almost always greater than \code{y[6]}. While \code{y[1]} is within the 95\% probability interval of \code{yhat[1,]}, \code{yhat[6,]} is above \code{y[6]} \Sexpr{round(summary(Pred)$Summary[6,7]*100,1)}\% of the time, indicating a strong discrepancy between the model and data, in this case.
\end{itemize}

There are also a variety of plots for posterior predictive checks, and the type of plot is controlled with the \code{Style} argument. Many styles exist, such as producing plots of covariates and residuals. The last component of this summary may be viewed graphically as posterior densities. Rather than observing plots for each of \Sexpr{NROW(Pred$yhat)} records or rows, only the first 9 densities will be shown here:

\begin{Scode}{eval=false}
plot(Pred, Style="Density", Rows=1:9)
\end{Scode}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig6,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(3,3))
for (j in 1:9){
     plot(density(Pred$yhat[j,]), xlab="Value",
          main=paste("Post. Pred. Plot of yhat[", j,
               ",]", sep=""), sub="Black=Density, Red=y")
     polygon(density(Pred$yhat[j,]), col="black", border="black")
     abline(v=Pred$y[j], col="red")
     }
\end{Scode}
\end{center}
\caption{Posterior Predictive Densities}
\end{figure}

Among many other options, the fit may be observed:

\begin{Scode}{eval=false}
plot(Pred, Style="Fitted")
\end{Scode}

\begin{figure}
\begin{center}
\begin{Scode}{label=fig7,fig=TRUE,echo=FALSE,width=6,height=6}
par(mfrow=c(1,1))
temp <- summary(Pred, Quiet=TRUE)$Summary
plot(temp[,1], temp[,5], pch=16, cex=0.75,
     ylim=c(min(temp[,c(1,4:6)], na.rm=TRUE),
          max(temp[,c(1,4:6)], na.rm=TRUE)),
     xlab="y", ylab="yhat", main="Fitted")
for (i in 1:length(y)) {
     lines(c(temp[i,1], temp[i,1]),
          c(temp[i,4], temp[i,6]))}
panel.smooth(temp[,1], temp[,5], pch=16, cex=0.75)
\end{Scode}
\end{center}
\caption{Posterior Predictive Fit}
\end{figure}

This plot shows a poor fit between the dependent variable and its expectation, and model revision should be considered.

The \code{Importance} function is not presented here in detail, but may be a useful way to assess variable importance, which is defined here as the impact of each variable on $\textbf{y}^{rep}$, when the variable is removed (or set to zero). Variable importance consists of differences in model fit or discrepancy statistics, showing how well the model fits the data with each variable removed. This information may be used for model revision, or presenting the relative importance of variables.

These posterior predictive checks indicate that there is plenty of room to improve this model.

\section{General Suggestions} \label{generalsuggestions}

Following are general suggestions on how best to use the \pkg{LaplacesDemon} package:

\begin{itemize}
\item As suggested by \citet{gelman08}, continuous predictors should be centered and scaled. Here is an explicit example in \proglang{R} of how to center and scale a single predictor called \code{x}: \code{x.cs <- (x - mean(x)) / (2*sd(x))}. However, it is instead easier to use the \code{CenterScale} function provided in \pkg{LaplacesDemon}.
\item Do not forget to reparameterize any bounded parameters in the \code{Model} function to be real-valued in the \code{parm} vector, and this is a good time to check for prior propriety with the \code{is.proper} function.
\item If sufficient sample size is available, begin with a deterministic numerical approximation algorithm such as Laplace Approximation or variational Bayes.
\item MCMC and PMC are stochastic methods of numerical approximation, and as such, results may differ with each run due to the use of pseudo-random number generation. It is good practice to set a seed so that each update of the model may be reproduced. Here is an example in \proglang{R}: \code{set.seed(666)}.
\item Rather than specify the final, intended model in the \code{Model} function, start by specifying the simplest possible form. Rather than beginning with actual data, start by simulating data given specified parameters. Update the simple model on simulated data and verify that the algorithm converges to the correct target distributions. One by one, add components to the model specification, simulate more complicated data, update, verify, and progress toward the intended model. If using MCMC during this phase, then use the \code{Juxtapose} function to compare the inefficiency of several MCMC algorithms (via integrated autocorrelation time or \code{IAT}), and use this information to select the least inefficient algorithm for your particular model. When confident the model is specified correctly and with informed algorithmic selection, finally use actual data, but with few iterations, such as \code{Iterations=20}.
\item After studying MCMC updates with few iterations, the first ``actual'' update should be long enough that proposals are accepted (the acceptance rate is not zero), adaptation begins to occur (if used), and that enough iterations occur after the first adaptation to allow the user to study the adaptation (assuming an adaptive algorithm is used).
\item Depending on the model specification function, data, and intended iterations, it is a good idea to use the \code{LaplacesDemon.RAM} function to estimate the amount of random-access memory (RAM) that \code{LaplacesDemon} will use. If \code{LaplacesDemon} uses more RAM than the computer has available, then the computer will crash. This can be used to estimate the maximum number of iterations or thinned samples for a particular model and data set on a given computer.
\item Once the final, intended model has begun (finally!), the mixing of the chains should be observed after a larger trial run, say, arbitrarily, for 10,000 iterations. If the chains do not mix as expected, then try a different algorithm, either one suggested by the \code{Consort} function (such as when diminishing adaptation is violated), or use the next least inefficient algorithm as indicated previously in the \code{Juxtapose} function.
\item When speed is a concern, such as with complex models, there may be things in the \code{Model} function that can be commented out, such as sometimes calculating \code{yhat}. The model can be updated without some features, that can be un-commented and used for posterior predictive checks. By commenting out things that are strictly unnecessary to updating, the model will update more quickly. Other helpful hints for speed are found in the documentation for the \code{Model.Spec.Time} function.
\item If the numerical approximation algorithm is exploring areas of the state space that the user knows \textit{a priori} should not be explored, then the parameters may be constrained in the \code{Model} function before being passed back to the numerical approximation function. Simply change the parameter of interest as appropriate and place the constrained value back in the \code{parm} vector.
\item For MCMC, \code{Demonic Suggestion} is intended as an aid, not an infallible replacement for critical thinking. As with anything else, its suggestions are based on assumptions, and it is the responsibility of the user to check those assumptions. For example, the \code{BMK.Diagnostic} may indicate stationarity (lack of a trend) when it does not exist. Or, the \code{Demonic Suggestion} may indicate that the next update may need to run for a million iterations in a complex model, requiring weeks to complete.
\item If an adaptive MCMC algorithm is used, then use a two-phase approach, where the first phase consists of using an adaptive algorithm to achieve stationary samples that seem to have converged to the target distributions (convergence can never be determined with MCMC, but some instances of non-convergence can be observed). Once it is believed that convergence has occurred, use a non-adaptive algorithm. The final samples should again be checked for signs of non-convergence. If satisfactory, then the non-adaptive algorithm should have estimated the logarithm of the marginal likelihood (LML). This is most easily checked with the \code{is.proper} function, which considers the joint posterior distribution to be proper if it can verify that the LML is finite.
\item The desirable number of final, thinned samples for inference depends on the required precision of the inferential goal. A good, general goal is to end up with 1,000 thinned samples \citep[p. 295]{gelman04}, where the ESS is at least 100 (and more is desirable). See the \code{ESS} function for more information.
\item Disagreement exists in MCMC literature as to whether to update one, long chain \citep{geyer92, geyer11}, or multiple, long chains with different, randomized initial values \citep{gelman92}. Multiple chains are enabled with an extension function called \code{LaplacesDemon.hpc}, which uses parallel processing. The \code{Gelman.Diagnostic} function may be used to compare multiple chains. Samples from multiple chains may be put together with the \code{Combine} function.
\item After a deterministic numerical approximation algorithm has converged, consider following it up with a stochastic numerical approximation algorithm such as MCMC, if practical. When MCMC seems to have converged, consider updating the model again, this time with Population Monte Carlo (PMC). PMC may improve the model fit obtained with MCMC, and should reduce the variance of the marginal posterior distributions, which is desirable for predictive modeling.
\item After a model has been updated, consider posterior predictive checks and any necessary model revisions. Afterward, consider updating a model with different prior distributions and compare results with the \code{BayesFactor} and \code{SensitivityAnalysis} functions, as well as comparing posterior predictive checks. Consider applying the model to different data sets and using the \code{Validate} function. Consider beyond the model how decision theory applies to the problem. Finally, make inferences, given the model and data.
\end{itemize}

\section{Independence and Observability} \label{independence}

The \pkg{LaplacesDemon} package was designed with independence and observability in mind. By independence, it is meant that a goal was to minimize dependence on other software. The \pkg{LaplacesDemon} package requires only base \proglang{R}, and the \pkg{parallel} package bundled with it. The variety of packages makes \proglang{R} extremely attractive. However, depending on multiple packages can be problematic when different packages have functions with the same name, or when a change is made in one package, but other packages do not keep pace, and the user is dependent on packages being in sync. By avoiding dependencies on packages that are not in or accompanying base \proglang{R}, the \pkg{LaplacesDemon} package is attempting to be consistent and dependable for the user.

For example, common MCMC diagnostics and probability distributions (such as Dirichlet, multivariate normal, Wishart, and many others, as well as truncated forms of distributions) in Bayesian inference have been included in the \pkg{LaplacesDemon} package so the user does not have to load numerous \proglang{R} packages, except of course for exotic distributions that have not been included.

\pkg{LaplacesDemonCpp} is an optional extension package that uses \proglang{C++}, and is not an independent package in the sense that it imports \pkg{parallel}, but also \pkg{Rcpp} and \pkg{RcppArmadillo}. Once obtained and activated, its use is seamless to a \pkg{LaplacesDemon} user. \pkg{LaplacesDemonCpp} is a stand-alone replacement of \pkg{LaplacesDemon}, and is currently in development.

By observability, it is meant that the base \pkg{LaplacesDemon} package is written entirely in \proglang{R}. Certain functions could be sped up in another language such as \proglang{C++}, but this may prevent some \proglang{R} users from understanding the code. The base \pkg{LaplacesDemon} package is intended to be open and accessible. The optional \pkg{LaplacesDemonCpp} package is available for faster computations via \proglang{C++}. If a user desires speed and is familiar with a faster language, then the user is encouraged to program the model specification function in the faster language. See the documentation for the \code{Model.Spec.Time} function for more information.

Observability also enables users to investigate or customize functions in the \pkg{LaplacesDemon} package. To access any function, simply enter the function name and press enter. For example, to print the source code for the \code{LaplacesDemon} function to the \proglang{R} console, simply enter:

\begin{Scode}{eval=false}
LaplacesDemon
\end{Scode}

Most undocumented, internal-only functions are exported in the namespace and have an alias in the LaplacesDemon-package.Rd file.

\pkg{LaplacesDemon} seeks to provide a complete, Bayesian environment within \proglang{R}. Independence from other software facilitates dependability, and its open code makes it easier for a user to investigate and customize.

\section{High Performance Computing} \label{hpc}

High performance computing (HPC) is a broad term that can mean many different things. The \pkg{LaplacesDemon} package currently uses the term HPC to refer to two topics: big data and parallel processing.

\subsection{Big Data}

There are several definitions for big data. Here, big data is defined as data that is too big for the computer memory (RAM). The \code{BigData} function enables updating a Bayesian model with big data by reading in and processing smaller batches or chunks of data and performing a user-specified function on the batch before combining and outputing the result, so the entire data set does not consume RAM. \code{BigData} is also parallelized. The \code{read.matrix} function allows sampling from big data. Finally, the Stochastic Gradient Descent (SGD) algorithm (see \ref{sgd}) in \code{LaplaceApproximation} and the Stochastic Gradient Langevin Dynamics (SGLD) algorithm in \code{LaplacesDemon} are designed specifically for use with big data.

\subsection{Parallel Processing}

Parallel processing occurs when software is designed to simultaneously use multiple central processing units (CPUs). The motherboard of a computer may contain multiple CPUs, such as a quad-core contains four, and this is called a multicore computer. Several computers may be linked together with network communication, forming what is called a computer cluster. The \pkg{LaplacesDemon} package has several functions that optionally take advantage of multicore computers or may utilize large computer clusters.

In the context of MCMC, there are three approaches to parallelization that are avilable in \pkg{LaplacesDemon}: parallel approximation within a chain, parallel sets of independent chains, and parallel sets of interactive chains. There are more parallelized functions in \pkg{LaplacesDemon} in addition to MCMC.

\subsection{Iterative Quadrature} \label{paraquad}

The \code{IterativeQuadrature} function provides several numerical integration algorithms, and each may be parallelized. At each iteration, the conditional density is evaluated at several nodes, and this processing may take advantage of multiple CPUs. For more information, see \ref{iterativequadrature}.

\subsection{Parallel Approximation within a Chain} \label{parapprox}

The Griddy-Gibbs (GG) sampler of \citet{ritter92}, Adaptive Griddy-Gibbs (AGG), and Multiple-Try Metropolis (MTM) of \citet{liu00} are examples of algorithms in which an approximation is made within a chain, and the approximation may be parallelized.

\subsection{Parallel Sets of Independent Chains} \label{indchains}

The \code{LaplacesDemon} function is extended with the \code{LaplacesDemon.hpc} function for the parallel processing of multiple chains on different central processing units (CPUs). This requires a minimum of two additional arguments: \code{Chains} to specify the number of parallel chains, and \code{CPUs} to specify the number of CPUs. The \code{LaplacesDemon.hpc} function allows the parallelization of most MCMC algorithms in the \code{LaplacesDemon} function.

An example of using \code{LaplacesDemon.hpc} is to simultaneously update three independent chains as an aid to checking MCMC convergence, as Gelman recommends \citep{gelman92}. Aside from aiding convergence, another benefit of parallelization is that more posterior samples are updated in the same time-frame as a non-parallel implementation. A multicore computer, such as a quad-core, will yield more posterior samples (which is valuable only if it converges, because it does not process more iterations), but a supercomputing environment or large computer cluster will yield many orders more. If multiple CPUs are available, then it only makes sense to use them...all.

It is important to note that \code{Status} messages do not print to the console during parallel processing with \code{LaplacesDemon.hpc}, and should alternately be directed by the user to a log file with the \code{LogFile} argument, if desired. The \code{LaplacesDemon.hpc} function sends the information associated with each chain as well as the \code{LaplacesDemon} function to each CPU. The \code{LaplacesDemon} function may very well return status messages, but the \code{LaplacesDemon.hpc} function is unaware.

After updating a model with \code{LaplacesDemon.hpc}, the \code{plot} function may be applied so that multiple chains may be viewed simultaneously, and this is helpful when comparing samplers for a specific model. If this looks good, then the \code{Gelman.Diagnostic} function may be applied to assess convergence. Otherwise, the \code{as.initial.values} function may be used to extract the latest values from the chains and use these to begin the next update. Once results seem acceptable, the \code{Combine} function may be used to combine the posterior samples of multiple chains into one \code{demonoid} object, from which the remaining facilities of the \pkg{LaplacesDemon} package are available.

The Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) algorithm of \citet{geyer91} is an example of an MCMC algorithm in which multiple chains are updated in parallel, but in \code{LaplacesDemon}, not \code{LaplacesDemon.hpc}.

\subsection{Parallel Sets of Interactive Chains} \label{intchains}

Parallel sets of independent chains should each run as efficiently as a traditional single set of chains. However, independent chains cannot benefit from the fact that there are other chains, while each chain is running. They are independent of each other.

In contrast, parallel sets of interactive chains are able to learn from each other through interaction. In the \pkg{LaplacesDemon} package, some of these algorithms are called with the \code{LaplacesDemon} function, and some with the \code{LaplacesDemon.hpc} function.

The Interchain Adaptation (INCA) algorithm \citep{craiu09, solonen12} performs Adaptive Metropolis (AM) with parallel chains that share the adaptive component, and this sharing speeds convergence. Whenever the chains are specified to adapt, adaptation is performed by pooling the historical covariance matrix across all parallel chains, and then returns the combined result to all chains. Network communication time slows the adaptation, but once returned to each CPU, chains iterate at their usual speed. This algorithm must be used with the \code{LaplacesDemon.hpc} function, and there is not an un-parallelized form of it.

The Affine-Invariant Ensemble Sampler (AIES) of \citep{goodman10} must be used with the \code{LaplacesDemon} function, and is available in either a parallelized or un-parallelized form. A large, even number of parallel chains (or walkers) are grouped into two batches, and each iteration, each chain moves in relation to a randomly selected chain (walker) in the other batch. Since these interactive chains interact each iteration, computer network communication is frequent, and this communication may be much slower than processing with one CPU. However, in a large-scale computing environment and when a \code{Model} function is not trivial to evaluate, this form of parallelization can result in very early convergence.

\subsection{Population Monte Carlo}

The \code{PMC} function has been parallelized at each iteration to speed up the evaluation of the model specification function over numerous importance samples.

\subsection{Predict Functions}

The predict functions (\code{predict.demonoid}, \code{predict.laplace}, \code{predict.pmc}) have been parallelized to speed up the prediction, or scoring, of larger data sets or when models have many posterior samples. The \code{Importance} function, which extensively uses predict functions, has also been parallelized.

\subsection{Model Specification Function}
A user may have a model with a model specification function that is computationally expensive, and may write their own parallelization code to speed up its processing by breaking down challenging computations and sending them to separate CPUs.

\subsection{Parallelization Details}

Parallelization is enabled by the \pkg{parallel} package that comes with base \proglang{R}. Parallelization is accomplished by default with socket-transport functions derived from the \pkg{snow} package, which is an acronym for a Simple Network of Workstations. Alternatively, Message Passing Interface (MPI) may be used. SNOW is more general, being cross-platform, and works on multicore computers, computer clusters, and supercomputers. More performance may be found with MPI, but it is more specialized. \code{LaplacesDemon.hpc} was reported to have been used successfully on a cluster with over 200 nodes.

\section{Details} \label{details}

The \pkg{LaplacesDemon} package uses five broad types of numerical approximation algorithms: Importance Sampling (IS), Iterative Quadrature, Laplace Approximation, Markov chain Monte Carlo (MCMC), and Variational Bayes (VB). Approximate Bayesian Computation (ABC) may be estimated within each. These numerical approximation algorithms are introduced below.

\subsection{Approximate Bayesian Computation} \label{abc}

Approximate Bayesian Computation (ABC), also called likelihood-free estimation, is a family of numerical approximation techniques in Bayesian inference. ABC is especially useful when evaluation of the likelihood, $p(\textbf{y} | \Theta)$ is computationally prohibitive, or when suitable likelihoods are unavailable. As such, ABC algorithms estimate likelihood-free approximations. ABC is usually faster than a similar likelihood-based numerical approximation technique, because the likelihood is not evaluated directly, but replaced with an approximation that is usually easier to calculate. The approximation of a likelihood is usually estimated with a measure of distance between the observed sample, $\textbf{y}$, and its replicate given the model, $\textbf{y}^{rep}$, or with summary statistics of the observed and replicated samples. See the accompanying vignette entitled ``Examples'' for an example.

\subsection{Importance Sampling} \label{importancesampling}

Importance Sampling (IS) is a method of estimating a distribution with samples from a different distribution, called the importance distribution. Importance weights are assigned to each sample. The main difficulty with IS is in the selection of the importance distribution. IS dates back at least to the 1950s, including iterative IS. IS is the basis of a wide variety of algorithms, some of which involve the combination of IS and Markov chain Monte Carlo (MCMC). There are also many variations of IS, including adaptive IS, and parametric and nonparametric self-normalized IS (SNIS). Some popular algorithms, or families of algorithms, that include IS are Particle Filtering, Population Monte Carlo (PMC), and Sequential Monte Carlo (SMC).

\subsubsection{Population Monte Carlo} \label{pmc}

Population Monte Carlo (PMC) uses adaptive IS, and the proposal or importance distribution is a multivariate Gaussian \citep{cappe04}, or a mixture of multivariate Gaussian distributions \citep{cappe08, wraith09}. \pkg{LaplacesDemon} uses the version presented in the appendix of \citet{wraith09}. At each iteration, the importance distribution of $N$ samples and $M$ mixture components is adapted. Parallel processing is available.

Compared with Markov chain Monte Carlo (MCMC), very few iterations are required, convergence and ergodicity are not problems, posterior samples are independent, and PMC lends itself well to parallelization. However, PMC requires much more prior information about the model (better initial values and proposal covariance matrix) than MCMC, and becomes harder to apply as the number of variables increases.

Amazingly, PMC may improve the model fit obtained with MCMC, and should reduce the variance of the marginal posterior distributions. This reduction in variance is desirable for predictive modeling. Therefore, it is recommended that a model is attempted to be updated with PMC after the model seems to have converged with MCMC.

\subsection{Iterative Quadrature} \label{iterativequadrature}

Quadrature is a historical term in mathematics that means determining area. Mathematicians of ancient Greece, according to the Pythagorean doctrine, understood determination of area of a figure as the process of geometrically constructing a square having the same area (squaring). Thus the name quadrature for this process.

In medieval Europe, quadrature meant the calculation of area by any method. With the invention of integral calculus, quadrature has been applied to the computation of a univariate definite integral. Numerical integration is a broad family of algorithms for calculating the numerical value of a definite integral. Numerical quadrature is a synonym for quadrature applied to one-dimensional integrals. Multivariate quadrature, also called cubature, is the application of quadrature to multidimensional integrals.

A quadrature rule is an approximation of the definite integral of a function, usually stated as a weighted sum of function values at specified points within the domain of integration. The specified points are referred to as abscissae, abscissas, integration points, or nodes, and have associated weights. The calculation of the nodes and weights of the quadrature rule differs by the type of quadrature. There are numerous types of quadrature algorithms. Bayesian forms of quadrature usually use Gauss-Hermite quadrature \citep{naylor82}, and placing a Gaussian Process on the function is a common extension \citep{ohagan91,rasmussen03} that is called `Bayesian Quadrature'. Often, these and other forms of quadrature are also referred to as model-based integration.

Gauss-Hermite quadrature uses Hermite polynomials to calculate the rule. However, there are two versions of Hermite polynomials, which result in different kernels in different fields. In physics, the kernel is $\exp(-x^2)$, while in probability the kernel is $\exp(-x^2/2)$. The weights are a normal density. If the parameters of the normal distribution, $\mu$ and $\sigma^2$, are estimated from data, then it is referred to as adaptive Gauss-Hermite quadrature, and the parameters are the conditional mean and conditional variance. Outside of Gauss-Hermite quadrature, adaptive quadrature implies that a difficult range in the integrand is subdivided with more points until it is well-approximated. Gauss-Hermite quadrature performs well when the integrand is smooth, and assumes normality or multivariate normality. Adaptive Gauss-Hermite quadrature has been demonstrated to outperform Gauss-Hermite quadrature in speed and accuracy.

A goal in quadrature is to minimize integration error, which is the error between the evaluations and the weights of the rule. Therefore, a goal in Bayesian Gauss-Hermite quadrature is to minimize integration error while approximating a marginal posterior distribution that is assumed to be smooth and normally-distributed. This minimization often occurs by increasing the number of nodes until a change in mean integration error is below a tolerance, rather than minimizing integration error itself, since the target may be only approximately normally distributed, or minimizing the sum of integration error, which would change with the number of nodes.

To approximate integrals in multiple dimensions, one approach applies $N$ nodes of a univariate quadrature rule to multiple dimensions (using the \code{GaussHermiteCubeRule} function for example) via the product rule, which results in many more multivariate nodes. This requires the number of function evaluations to grow exponentially as dimension increases. Multidimensional quadrature is usually limited to less than ten dimensions, both due to the number of nodes required, and because the accuracy of multidimensional quadrature algorithms decreases as the dimension increases. Three methods may overcome this curse of dimensionality in varying degrees: componentwise quadrature, sparse grids, and Monte Carlo.

Componentwise quadrature is the iterative application of univariate quadrature to each parameter. It is applicable with high-dimensional models, but sacrifices the ability to calculate the conditional covariance matrix, and calculates only the variance of each parameter.

Sparse grids were originally developed by Smolyak for multidimensional quadrature. A sparse grid is based on a one-dimensional quadrature rule. Only a subset of the nodes from the product rule is included, and the weights are appropriately rescaled. Although a sparse grid is more efficient because it reduces the number of nodes to achieve the same accuracy, the user must contend with increasing the accuracy of the grid, and it remains inapplicable to high-dimensional integrals.

Monte Carlo is a large family of sampling-based algorithms. \citet{ohagan87} asserts that Monte Carlo is frequentist, inefficient, regards irrelevant information, and disregards relevant information. Quadrature, he maintains \citep{ohagan92}, is the most Bayesian approach, and also the most efficient. In high dimensions, he concedes, a popular subset of Monte Carlo algorithms is currently the best for cheap model function evaluations. These algorithms are called Markov chain Monte Carlo (MCMC). High-dimensional models with expensive model evaluation functions, however, are not well-suited to MCMC. A large number of MCMC algorithms is available in the \code{LaplacesDemon} function.

Following are some reasons to consider iterative quadrature rather than MCMC. Once an MCMC sampler finds equilibrium, it must then draw enough samples to represent all targets. Iterative quadrature does not need to continue drawing samples. Multivariate quadrature is consistently reported as more efficient than MCMC when its assumptions hold, though multivariate quadrature is limited to small dimensions. High-dimensional models therefore default to MCMC, between the two. Componentwise quadrature algorithms like CAGH, however, may also be more efficient with clock-time than MCMC in high dimensions, especially against componentwise MCMC algorithms. Another reason to consider iterative quadrature are that assessing convergence in MCMC is a difficult topic, but not for iterative quadrature. A user of iterative quadrature does not have to contend with effective sample size and autocorrelation, assessing stationarity, acceptance rates, diminishing adaptation, etc. Stochastic sampling in MCMC is less efficient when samples occur in close proximity (such as when highly autocorrelated), whereas in quadrature the nodes are spread out by design.

In general, the conditional means and conditional variances progress smoothly to the target in multidimensional quadrature. For componentwise quadrature, movement to the target is not smooth, and often resembles a Markov chain or optimization algorithm.

Iterative quadrature is often applied after \code{LaplaceApproximation} to obtain a more reliable estimate of parameter variance or covariance than the negative inverse of the \code{Hessian} matrix of second derivatives, which is suitable only when the contours of the logarithm of the unnormalized joint posterior density are approximately ellipsoidal \citep{naylor82}.

\subsubsection{Adaptive Gauss-Hermite} \label{agh}

The Adaptive Gauss-Hermite (AGH) algorithm is the \citet{naylor82} algorithm. The AGH algorithm uses multivariate quadrature with the physicist's (not the probabilist's) kernel.

There are four algorithm specifications: \code{N} is the number of univariate nodes, \code{Nmax} is the maximum number of univariate nodes, \code{Packages} accepts any package required for the model function when parallelized, and \code{Dyn.libs} accepts dynamic libraries for parallelization, if required. The number of univariate nodes begins at $N$ and increases by one each iteration. The number of multivariate nodes grows quickly with $N$. \citet{naylor82} recommend beginning with as few nodes as $N=3$. Any of the following events will cause $N$ to increase by 1 when $N$ is less than \code{Nmax}:

\begin{itemize}
    \item All LP weights are zero (and non-finite weights are set to zero)
    \item $\mu$ does not result in an increase in LP
    \item All elements in $\Sigma$ are not finite
    \item The square root of the sum of the squared changes in $\mu$ is less than or equal to the \code{Stop.Tolerance}
\end{itemize}

Tolerance includes two metrics: change in mean integration error and change in parameters. Including the change in parameters for tolerance was not mentioned in \citet{naylor82}.

\citet{naylor82} consider a transformation due to correlation. This is not included here.

The AGH algorithm does not currently handle constrained parameters, such as with the \code{interval} function. If a parameter is constrained and changes during a model evaluation, this changes the node and the multivariate weight. This is currently not corrected.

An advantage of AGH over componentwise adaptive quadrature is that AGH estimates covariance, where a componentwise algorithm ignores it. A disadvantage of AGH over a componentwise algorithm is that the number of nodes increases so quickly with dimension, that AGH is limited to small-dimensional models.

\subsubsection{Adaptive Gauss-Hermite Sparse Grid} \label{aghsg}

The Adaptive Gauss-Hermite Sparse Grid (AGHSG) algorithm is the \citet{naylor82} algorithm applied to a sparse grid, rather than a traditional multivariate quadrature rule. This is identical to the AGH algorithm above, except that a sparse grid replaces the multivariate quadrature rule.

The sparse grid reduces the number of nodes. The cost of reducing the number of nodes is that the user must consider the accuracy, $K$.

There are four algorithm specifications: \code{K} is the accuracy (as a positive integer), \code{Kmax} is the maximum accuracy, \code{Packages} accepts any package required for the model function when parallelized, and \code{Dyn.libs} accepts dynamic libraries for parallelization, if required. These arguments represent accuracy rather than the number of univariate nodes, but otherwise are similar to the AGH algorithm.

\subsubsection{Componentwise Adaptive Gauss-Hermite} \label{cagh}

The Componentwise Adaptive Gauss-Hermite (CAGH) algorithm is a componentwise version of the adaptive Gauss-Hermite quadrature of \citet{naylor82}. Each iteration, each marginal posterior distribution is approximated sequentially, in a random order, with univariate quadrature. The conditional mean and conditional variance are also approximated each iteration, making it an adaptive algorithm.

There are four algorithm specifications: \code{N} is the number of nodes, \code{Nmax} is the maximum number of nodes, \code{Packages} accepts any package required for the model function when parallelized, and \code{Dyn.libs} accepts dynamic libraries for parallelization, if required. The number of nodes begins at $N$. All parameters have the same number of nodes. Any of the following events will cause $N$ to increase by 1 when $N$ is less than \code{Nmax}, and these conditions refer to all parameters (not individually):

\begin{itemize}
    \item Any LP weights are not finite
    \item All LP weights are zero
    \item $\mu$ does not result in an increase in LP
    \item The square root of the sum of the squared changes in $\mu$ is less than or equal to the \code{Stop.Tolerance}
\end{itemize}

It is recommended to begin with \code{N=3} and set \code{Nmax} between 10 and 100. As long as CAGH does not experience problematic weights, and as long as CAGH is improving LP with $\mu$, the number of nodes does not increase. When CAGH becomes either universally problematic or universally stable, then $N$ slowly increases until the sum of both the mean integration error and the sum of the squared changes in $\mu$ is less than the \code{Stop.Tolerance} for two consecutive iterations.

If the highest LP occurs at the lowest or highest node, then the value at that node becomes the conditional mean, rather than calculating it from all weighted samples; this facilitates movement when the current integral is poorly centered toward a well-centered integral. If all weights are zero, then a random proposal is generated with a small variance.

Tolerance includes two metrics: change in mean integration error and change in parameters, as the square root of the sum of the squared differences.

When a parameter constraint is encountered, the node and weight of the quadrature rule is recalculated.

An advantage of CAGH over multidimensional adaptive quadrature is that CAGH may be applied in large dimensions. Disadvantages of CAGH are that only variance, not covariance, is estimated, and ignoring covariance may be problematic.

\subsection{Laplace Approximation} \label{laplaceapproximation}

The Laplace Approximation or Laplace Method is a family of asymptotic techniques used to approximate integrals. Laplace's method seems to accurately approximate unimodal posterior moments and marginal posterior distributions in many cases. Since it is not applicable in all cases, it is recommended here that Laplace Approximation is used cautiously in its own right, or preferably, it is used before MCMC.

After introducing the Laplace Approximation \citep[p. 366--367]{laplace74}, a proof was published later \citep{laplace14} as part of a mathematical system of inductive reasoning based on probability. Laplace used this method to approximate posterior moments.

Since its introduction, the Laplace Approximation has been applied successfully in many disciplines. In the 1980s, the Laplace Approximation experienced renewed interest, especially in statistics, and some improvements in its implementation were introduced \citep{tierney86, tierney89}. Only since the 1980s has the Laplace Approximation been seriously considered by statisticians in practical applications.

There are many variations of Laplace Approximation, with an effort toward replacing Markov chain Monte Carlo (MCMC) algorithms as the dominant form of numerical approximation in Bayesian inference. The run-time of Laplace Approximation is a little longer than Maximum Likelihood Estimation (MLE), usually shorter than variational Bayes, and much shorter than MCMC \citep{azevedo94}.

The speed of Laplace Approximation depends on the optimization algorithm selected, and typically involves many evaluations of the objective function per iteration (where an MCMC algorithm with a multivariate proposal usually evaluates once per iteration), making many MCMC algorithms faster per iteration. The attractiveness of Laplace Approximation is that it typically improves the objective function better than iterative quadrature, MCMC, and PMC when the parameters are in low-probability regions. Laplace Approximation is also typically faster than MCMC and PMC because it is seeking point-estimates, rather than attempting to represent the target distribution with enough simulation draws. Laplace Approximation extends MLE, but shares similar limitations, such as its asymptotic nature with respect to sample size and that marginal posterior distributions are Gaussian. \citet{bernardo00} note that Laplace Approximation is an attractive family of numerical approximation algorithms, and will continue to develop.

\code{LaplaceApproximation} seeks a global maximum of the logarithm of the unnormalized joint posterior density. The approach differs by \code{Method}. The \code{LaplacesDemon} function uses the \code{LaplaceApproximation} algorithm to optimize initial values and save time for the user.

Most optimization algorithms assume that the logarithm of the unnormalized joint posterior density is defined and differentiable\footnote{When the joint posterior is not differentiable, and should be, it has probably encountered an area of flat density. It is recommended that WIPs are used for regularization. For more information on WIPs, see the accompanying vignette entitled ``Bayesian Inference''.}. Some methods calculate an approximate gradient for each initial value as the difference in the logarithm of the unnormalized joint posterior density due to a slight increase in the parameter.

The user may select from numerous optimization algorithms:

\subsubsection{Adaptive Gradient Ascent} \label{aga}

With adaptive gradient ascent, the direction and distance for each parameter is proposed based on an approximate truncated graident and an adaptive step size. The step size parameter, which is often plural and called rate parameters in other literature, is adapted each iteration with the univariate version of the Robbins-Monro stochastic approximation in \citet{garthwaite10}. The step size shrinks when a proposal is rejected and expands when a proposal is accepted.

Gradient ascent is criticized for sometimes being relatively slow when close to the maximum, and its asymptotic rate of convergence is inferior to other methods. However, compared to other popular optimization algorithms such as Newton-Raphson, an advantage of the gradient ascent is that it works in infinite dimensions, requiring only sufficient computer memory. Although Newton-Raphson converges in fewer iterations, calculating the inverse of the negative Hessian matrix of second-derivatives is more computationally expensive and subject to singularities. Therefore, gradient ascent takes longer to converge, but is more generalizable.

\subsubsection{BFGS} \label{bfgs}

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm was proposed independently by \citet{broyden70}, \citet{fletcher70}, \citet{goldfarb70}, and \citet{shanno70}. BFGS may be the most efficient and popular quasi-Newton optimiziation algorithm. As a quasi-Newton algorithm, the Hessian matrix is approximated using rank-one updates specified by (approximate) gradient evaluations. Since BFGS is very popular, there are many variations of it. This is a version by Nash that has been adapted from the Rvmmin package, and is used in the \code{optim} function of base R. The approximate Hessian is not guaranteed to converge to the Hessian. When BFGS is used, the approximate Hessian is not used to calculate the final covariance matrix.

\subsubsection{BHHH} \label{bhhh}

The BHHH algorithm of \citet{berndt74} is a quasi-Newton method that includes a step-size parameter, partial derivatives, and an approximation of a covariance matrix that is calculated as the inverse of the sum of the outer product of the gradient (OPG), calculated from each record. The OPG method becomes more costly with data sets with more records. Since partial derivatives must be calculated per record of data, the list of data has special requirements with this method, and must include design matrix \textbf{X}, and dependent variable \textbf{y} or \textbf{Y}. Records must be row-wise. An advantage of BHHH over NR (see below) is that the covariance matrix is necessarily positive definite, and gauranteed to provide an increase in LP each iteration (given a small enough step-size), even in convex areas. The covariance matrix is better approximated with larger data sample sizes, and when closer to the maximum of LP. Disadvantages of BHHH include that it can give small increases in LP, especially when far from the maximum or when LP is highly non-quadratic.

\subsubsection{Conjugate Gradient} \label{cg}

Conjugate gradient (CG) is a family of algorithms that uses partial derivatives, but does not use the Hessian matrix or any approximation of it. CG usually requires more iterations to reach convergence than other algorithms that use the Hessian or an approximation. However, since the Hessian becomes computationally expensive as the dimension of the model grows, CG is applicable to large dimensional models. CG was originally developed by \citet{hestenes52}. The version here is a nonlinear CG method.

\subsubsection{Davidon-Fletcher-Powell} \label{dfp}

The Davidon-Fletcher-Powell (DFP) algorithm was the first popular, multidimensional, quasi-Newton optimization algorithm. The DFP update of an approximate Hessian matrix maintains symmetry and positive-definiteness. The approximate Hessian is not guaranteed to converge to the Hessian. When DFP is used, the approximate Hessian is not used to calculate the final covariance matrix. Although DFP is very effective, it was superseded by the BFGS algorithm.

\subsubsection{Hit-And-Run} \label{har}

This version of the Hit-And-Run (HAR) algorithm makes multivariate proposals and uses an adpative length parameter. The length parameter is adapted each iteration with the univariate version of the Robbins-Monro stochastic approximation in \citet{garthwaite10}. The length shrinks when a proposal is rejected and expands when a proposal is accepted. This is the same algorithm as the HARM or Hit-And-Run Metropolis MCMC algorithm with adaptive length, except that a Metropolis step is not used.

\subsubsection{Hooke-Jeeves} \label{hj}

The Hooke-Jeeves algorithm \citep{hooke61} is a derivative-free, direct search method. Each iteration involves two steps: an exploratory move and a pattern move. The exploratory move explores local behavior, and the pattern move takes advantage of pattern direction. It is sometimes described as a hill-climbing algorithm. If the solution improves, it accepts the move, and otherwise rejects it. Step size decreases with each iteration. The decreasing step size can trap it in local maxima, where it gets stuck and convergences erroneously. Users are encouraged to attempt again after what seems to be convergence, starting from the latest point. Although getting stuck at local maxima can be problematic, the Hooke-Jeeves algorithm is also attractive because it is simple, fast, does not depend on derivatives, and is otherwise relatively robust.

\subsubsection{Levenberg-Marquardt} \label{levenberg}

Also known as the Levenberg-Marquardt Algorithm (LMA) or the Damped Least-Squares (DLS) method, Levenberg-Marquardt (LM) is a trust region (not to be confused with TR below) optimization algorithm that minimizes nonlinear least squares, and has been adapted here to maximize LP. LM uses partial derivatives and approximates the Hessian with outer-products. It is suitable for nonlinear optimization up to a few hundred parameters, but loses its efficiency in larger problems due to matrix inversion. LM is considered between the Gauss-Newton algorithm and gradient descent. When far from the solution, LM moves slowly like gradient descent, but is guaranteed to converge. When LM is close to the solution, LM becomes a damped Gauss-Newton method.

\subsubsection{Limited-Memory BFGS} \label{lmbfgs}

The limited-memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) algorithm is a quasi-Newton optimization algorithm that compactly approximates the Hessian matrix. Rather than storing the dense Hessian matrix, L-BFGS stores only a few vectors that represent the approximation. This algorithm is better suited for large-scale models than the BFGS algorithm. When (\code{method="LBFGS"}) for \code{LaplaceApproximation}, \code{method="L-BFGS-B"} is called in the \code{optim} function of base \proglang{R}.

\subsubsection{Nelder-Mead} \label{nm}

The Nelder-Mead algorithm \citep{nelder65} is a derivative-free, direct search method that is known to become inefficient in large-dimensional problems. As the dimension increases, the search direction becomes increasingly orthogonal to the steepest ascent (usually descent) direction. However, in smaller dimensions it is a popular algorithm. At each iteration, three steps are taken to improve a simplex: reflection, extension, and contraction.

\subsubsection{Newton-Raphson} \label{nr}

The Newton-Raphson optimization algorithm, also known as Newton's Method, uses derivatives and a Hessian matrix. The algorithm is included for its historical significance, but is known to be problematic when starting values are far from the targets, and calculating and inverting the Hessian matrix can be computationally expensive. As programmed here, when the Hessian is problematic, it tries to use only the derivatives, and when that fails, a jitter is applied. Newton-Raphson should not be the first choice of the user, and BFGS should always be preferred.

\subsubsection{Particle Swarm Optimization} \label{pso}

Of numerous Particle Swarm Optimization (PSO) algorithms, the Standard Particle Swarm Optimization 2007 (SPSO 07) algorithm is used here. A swarm of particles is moved according to velocity, neighborhood, and the best previous solution. The neighborhood for each particle is a set of informing particles. PSO is derivative-free.

\subsubsection{Resilient Backpropagation} \label{resilientbackprop}

``Rprop'' stands for resilient backpropagation. In Rprop, the approximate gradient is taken for each parameter in each iteration, and its sign is compared to the approximate gradient in the previous iteration. A weight element in a weight vector is associated with each approximate gradient. A weight element is multiplied by 1.2 when the sign does not change, or by 0.5 if the sign changes. The weight vector is the step size, and is constrained to the interval [0.001, 50], and initial weights are 0.0125. This is the resilient backpropagation algorithm, which is often denoted as the ``Rprop-'' algorithm of \citet{riedmiller94}.

\subsubsection{Self-Organizing Migration Algorithm} \label{soma}

The Self-Organizing Migration Algorithm (SOMA) of \citet{zelinka04}, as used here, moves a population of ten particles or individuals in the direction of the best particle, the leader. The leader does not move in each iteration, and a line-search is used for each non-leader, up to three times the difference in parameter values between each non-leader and leader. This algorithm is derivative-free and often considered in the family of evolution algorithms. Numerous model evaluations are performed per non-leader per iteration.

\subsubsection{Spectral Projected Gradient} \label{spg}

The Spectral Projected Gradient (SPG) algorithm is a non-monotone algorithm that is suitable for high-dimensional models. The approximate gradient is used, but the Hessian matrix is not. SPG is the default algorithm for the \code{LaplaceApproximation} function.

\subsubsection{Stochastic Gradient Descent} \label{sgd}

The stochastic gradient descent (SGD) algorithm, here, is designed only for big data. Traditional optimization algorithms require the entire data set to be included in the model evaluation each iteration. In contrast, SGD reads and processes only a small, randomly selected batch of records each iteration. In addition to saving computation time, the entire data set does not need to be loaded into memory at once.

In this version of SGD, a multivariate proposal is used, and it is merely the vector of current values plus a step size times the gradient.

SGD requires five objects in the \code{Data} list: \code{epsilon} or $\epsilon$ is the step size as a scalar, \code{file} is a quoted name of a .csv file that is the big data set, \code{Nr} is the number of rows in the big data set, \code{Nc} is the number of columns in the big data set, and \code{size} is the number of rows to be read and processed each iteration.

Since SGD, as implemented here, is designed for big data, the entire data set is not included in the \code{Data} list, but one small batch must be included and named \code{X}. All data must be included. For example, both the dependent variable \textbf{y} and design matrix \textbf{X} in linear regression are included. The requirement for the small batch to be in \code{Data} is so that numerous checks may be passed after \code{LaplaceApproximation} is called and before the SGD algorithm begins. Each iteration, SGD uses the \code{scan} function, without headers, to read a random block of rows from, say, \code{X.csv}, stores it in \code{Data$X}, and passes it to the \code{Model} specification function. The \code{Model} function must differ from the other examples found in this package in that multiple objects, such as \code{X} and \code{y} must be read from \code{Data$X}, where usually there is both \code{Data$X} and \code{Data$y}.

The user tunes SGD with step size $\epsilon$ via \code{Data$epsilon}. The step size must be scalar and remain in the interval (0,1). When $\epsilon = 0$, SGD is reduced to zero, the algorithm will not move, and false convergence occurs. When $\epsilon$ is too large, degenerate results occur. A good recommendation seems to be to begin with $\epsilon$ set to \code{1/Nr}. The user may perform several short runs, and experimenting with adjusting \code{Data$epsilon}. At least \code{Nr / size} iterations are suggested.

\subsubsection{Symmetric Rank-One} \label{srone}

The Symmetric Rank-One (SR1) algorithm is a quasi-Newton optimization algorithm, and the Hessian matrix is approximated, often without being positive-definite. At the posterior modes, the true Hessian is usually positive-definite, but this is often not the case during optimization when the parameters have not yet reached the posterior modes. Other restrictions, including constraints, often result in the true Hessian being indefinite at the solution. For these reasons, SR1 often outperforms BFGS. The approximate Hessian is not guaranteed to converge to the Hessian. When SR1 is used, the approximate Hessian is not used to calculate the final covariance matrix.

\subsubsection{Trust Region} \label{trust}

The Trust Region (TR) algorithm of \citet{nocedal99} attempts to reach its objective in the fewest number of iterations, is therefore very efficient, as well as safe. The efficiency of TR is attractive when model evaluations are expensive.The Hessian is approximated each iteration, making TR best suited to models with small to medium dimensions, say up to a few hundred parameters.

\subsubsection{Afterward} \label{afterward1}

After \code{LaplaceApproximation} finishes, due either to early convergence or completing the number of specified iterations, it approximates the Hessian matrix of second derivatives (by default, but the user has other options), and attempts to calculate the covariance matrix by taking the inverse of the negative of this matrix. If successful, then this covariance matrix may be passed to \code{IterativeQuadrature}, \code{LaplacesDemon}, or \code{PMC}, and the diagonal of this matrix is the variance of the parameters. If unsuccessful, then a scaled identity matrix is returned, and each parameter's variance will be 1.

\subsection{Markov Chain Monte Carlo} \label{mcmc}

Markov chain Monte Carlo (MCMC) algorithms are also called samplers. There are a large number of MCMC algorithms, too many to review here. Popular families (which are often non-distinct) include Gibbs sampling, Metropolis-Hastings, slice sampling, Hamiltonian Monte Carlo, and many others. Though the name is misleading, Metropolis-within-Gibbs (MWG) was developed first \citep{metropolis53}, and Metropolis-Hastings was a generalization of MWG \citep{hastings70}. All MCMC algorithms are known as special cases of the Metropolis-Hastings algorithm. Regardless of the algorithm, the goal in Bayesian inference is to maximize the unnormalized joint posterior distribution and collect samples of the target distributions, which are marginal posterior distributions, later to be used for inference.

The most generalizable MCMC algorithm is the Metropolis-Hastings (MH) generalization of the MWG algorithm. The MH algorithm extended MWG to include asymmetric proposal distributions. For years, the main disadvantage of the MWG algorithms was that the proposal variance (see below) had to be tuned manually, and therefore other MCMC algorithms have become popular because they do not need to be tuned.

Gibbs sampling became popular for Bayesian inference, though it requires conditional sampling of conjugate distributions, so it is precluded from non-conjugate sampling in its purest form. Gibbs sampling also suffers under high correlations \citep{gilks96}. Due to these limitations, Gibbs sampling is less generalizable than RWM, though RWM and other algorithms are not immune to problems with correlation. The Griddy-Gibbs sampler evaluates a grid of proposals and approximates the conditional distribution, which enables non-conjugate sampling. Componentwise slice sampling is a special case of Gibbs sampling that samples a distribution by sampling uniformly from the region under the plot of its density function, and is more appropriate with bounded distributions that cannot approach infinity, though the improved slice sampler of \citet{neal03} is available here.

\subsubsection{Blockwise Sampling} \label{block}

Usually, there is more than one target distribution, in which case it must be determined whether it is best to sample from target distributions individually, in groups, or all at once. Block updating refers to splitting a multivariate vector into groups called blocks, and each block is sampled separately. A block may contain one or more parameters.

Parameters are usually grouped into blocks such that parameters within a block are as correlated as possible, and parameters between blocks are as independent as possible. This strategy retains as much of the parameter correlation as possible for blockwise sampling, as opposed to componentwise sampling where parameter correlation is ignored. The \code{PosteriorChecks} function can be used on the output of previous runs to find highly correlated parameters, and the \code{Blocks} function may be used to create blocks based on posterior correlation.

Advantages of blockwise sampling are that a different MCMC algorithm may be used for each block (or parameter, for that matter), creating a more specialized approach (though different algorithms by block are not supported here), the acceptance of a newly proposed state is likely to be higher than sampling from all target distributions at once in high dimensions, and large proposal covariance matrices can be reduced in size, which is most helpful again in high dimensions.

Disadvantages of blockwise sampling are that correlations probably exist between parameters between blocks, and each block is updated while holding the other blocks constant, ignoring these correlations of parameters between blocks. Without simultaneously taking everything into account, the algorithm may converge slowly or never arrive at the proper solution. However, there are instances when it may be best when everything is not taken into account at once, such as in state-space models. Also, as the number of blocks increases, more computation is required, which slows the algorithm. In general, blockwise sampling allows a more specialized approach at the expense of accuracy, generalization, and speed. Blockwise sampling is offered in the following algorithms: Adaptive Metropolis-within-Gibbs (AMWG), Adaptive-Mixture Metropolis (AMM), Automated Factor Slice Sampler (AFSS), Elliptical Slice Sampler (ESS), Hit-And-Run Metropolis (HARM), Metropolis-within-Gibbs (MWG), Random-Walk Metropolis (RWM), Robust Adaptive Metropolis (RAM), and the Univariate Eigenvector Slice Sampler (UESS).

\subsubsection{Markov Chain Properties} \label{markovchainproperties}

This tutorial introduces only briefly the basics of Markov chain properties. A Markov chain is Markovian when the current iteration depends only on the previous iteration. Many (but not all) adaptive algorithms are merely chains but not Markov chains when the adaptation is based on the history of the chains, not just the previous iteration. A Markov chain is said to be aperiodic when it is not repeating a cycle. A Markov chain is considered irreducible when it is possible to go from any state to any other state, though not necessarily in one iteration. A Markov chain is said to be recurrent if it will eventually return to a given state with probability 1, and it is positive recurrent if the expected return time is finite, and null recurrent otherwise. The ergodic theorem states that a Markov chain is ergodic when it is aperiodic, irreducible, and positive recurrent.

The non-Markovian chains of an adaptive algorithm that adapt based on the history of the chains should have two conditions: containment and diminishing adaptation. Containment is difficult to implement and is not currently programmed into \pkg{LaplacesDemon}. The condition of diminishing adaptation is fulfilled when the amount of adaptation diminishes with the length of the chain. Diminishing adaptation can be achieved when the proposal variances become smaller or by decreasing the probability of performing adaptations with more iterations \citep{roberts07}. Trace-plots of the output of the \code{LaplacesDemon} function automatically include plots of the absolute differences in proposal variance with each adaptation for adaptive algorithms, and the \code{Consort} function will try to suggest a different adaptive algorithm when these absolute differences are not trending downward.

Descriptions of the MCMC algorithms in the \pkg{LaplacesDemon} package are available online at \url{https://web.archive.org/web/20150227012508/http://www.bayesian-inference.com/mcmc}.

\subsubsection{Sampler Selection} \label{samplerselection}

The optimal sampler differs for each problem, and it is recommended that the \code{Juxtapose} function is used to help select the least inefficient MCMC algorithm. Nonetheless, some general observations here may be helpful to a user attempting to select the most appropriate sampler for a given model. Suggestions in this section have been reached by attempting to compare all samplers on most models in the accompanying ``Examples'' vignette. Comparisons consisted of

\begin{itemize}
\item diminishing adaptation, if applicable
\item how many iterations it took the sampler to seem to converge
\item how many minutes it took the sampler to seem to converge
\item how quickly the sampler improved in the beginning
\item \code{Juxtapose} results based on integrated autocorrelation time (\code{IAT})
\item mixing of the chains
\item whether or not the sampler arrived at the correct solution
\end{itemize}

When the user is ready to select a general-purpose sampler, the best place to begin is with the AFSS algorithm. This is not to say that AFSS is the best sampler and everything else pales by comparison. Instead, AFSS is a great sampler with which to start in the general case, and for beginners. Although AFSS has several algorithm specifications, the default specifications are suitable for many cases.

A new user should not begin to learn AFSS and this package with a complicated and high-dimensional model. When this is necessary, the user of AFSS will need to learn how to create blocks of parameters and a list of proposal covariance matrices. In smaller cases, more suitable to learning, the user should not generally have to adjust the \code{m} or \code{w} specifications, and need only learn \code{A} and \code{n}. A new user should begin with \code{A=Inf} and \code{n=0}, and use code provided by the \code{Consort} function for the next run. When the user is satified that equilibrium is reached, then another run should be made without adaptation: \code{A=0}.

Models with multimodal marginal posterior distributions are potentially troublesome for any numerical approximation algorithm, though MCMC may be better suited in general. It is best to begin either with MCMCMC or RDMH. Alternatives include AFSS, AGG, CHARM, GG, HARM, RAM, Slice, THMC, or t-walk. The MCMCMC and RDMH algorithms have demonstrated remarkable performance with multimodal distributions. The use of parallel chains in MCMCMC increases the chances that different chains may settle on different modes. Parallel chains from other parallelized algorithms may be helpful in finding multiple modes, but when the chains are combined with the \code{Combine} function for inference, each mode probably is not represented in a proportion correct for the distribution. Consider updating the model with PMC, with multiple mixture components, after MCMC is finished. Unlike MCMC with parallel chains, the proportion of each mode will be correctly represented with PMC.

Models with discrete parameters currently require either the AGG, GG, or Slice algorithms, or converting the discrete parameters to continuous parameters so that any MCMC algorithm may be used. This is performed via the continuous relaxation of a Markov random field (MRF), as in \citet{zhang12}. For more information, see the \code{dcrmrf} and \code{rcrmrf} functions.

Models with big data sets, too big for memory, may use the SGLD algorithm, the \code{BigData} function, or opt for alternative methods suggested in the details of the documentation for the \code{BigData} function.

Regardless of the model or algorithm, parallel chains are recommended in general, provided the user has multiple CPUs and enough random-access memory (RAM). However, it is best to begin with a single chain, until the user is confident in the model specification. Parallel chains produce more posterior samples upon convergence than single chains in roughly the same amount of time, and may facilitate the discovery of multimodal marginal posterior distributions that would otherwise have been overlooked. Although algorithms may update independently in parallel, there are several that learn from other parallel updates, such as AIES and INCA.

The \code{Demonic Suggestion} section of output from the \code{Consort} function also attempts to help the user to select a sampler. There are exceptions to each of these suggestions above. In some cases, a particular algorithm will fail to update for a given example. Hopefully this section assists the user in selecting a sampler.

\subsubsection{Afterward} \label{afterward}

Once the model is updated with the \code{LaplacesDemon} function, the \code{BMK.Diagnostic} function is applied to 10 batches of the thinned samples to assess stationarity (or lack of trend). When all parameters are estimated as stationary beyond a given iteration, the previous iterations are suggested to be considered as burn-in and discarded.

The importance of Monte Carlo Standard Error (MCSE) is debated \citep{gelman04,jones06}. It is included in posterior summaries of \code{LaplacesDemon}, and is one of five main criteria as a stopping rule to appease Laplace's Demon. MCSE has been shown to be a better stopping rule than MCMC diagnostics \citep{jones06}. \pkg{LaplacesDemon} provides a \code{MCSE} function that allows three methods of estimation: sample variance, batch means \citep{jones06}, and Geyer's method \citep{geyer92}.

The user is encouraged to explore MCMC diagnostics (also called convergence diagnostics). The \pkg{LaplacesDemon} package offers \code{AcceptanceRate}, the \code{BMK.Diagnostic}, a Cumulative Sample Function (\code{CSF}), Effective Sample Size (\code{ESS}), \code{Gelfand.Diagnostic}, \code{Gelman.Diagnostic}, \code{Geweke.Diagnostic}, \code{Heidelberger.Diagnostic}, Integrated Autocorrelation Time (\code{IAT}), the Kolmogorov-Smirnov test (\code{KS.Diagnostic}), Monte Carlo Standard Error (\code{MCSE}), \code{Raftery.Diagnostic}, and both the \code{plot} and \code{PosteriorChecks} functions include multiple diagnostics.

\subsection{Variational Bayes} \label{variationalbayes}

Variational Bayes (VB) is a family of numerical approximation algorithms that is a subset of variational inference algorithms, or variational methods. Some examples of variational methods include the mean-field approximation, loopy belief propagation, tree-reweighted belief propagation, and expectation propagation (EP). Variational inference for probabilistic models was introduced in the field of machine learning, influenced by statistical physics literature.

A VB algorithm deterministically estimates the marginal posterior distributions (target distributions) in a Bayesian model with approximated distributions by minimizing the Kullback-Leibler Divergence (\code{KLD}) between the target and its approximation. The complicated posterior distribution is approximated with a simpler distribution. The simpler, approximated distribution is called the variational approximation, or approximation distribution, of the posterior. The term variational is derived from the calculus of variations, and regards optimization algorithms that select the best function (which is a distribution in VB), rather than merely selecting the best parameters.

VB algorithms often use Gaussian distributions as approximating distributions. In this case, both the mean and variance of the parameters are estimated.

Usually, a VB algorithm is slower to convergence than a Laplace Approximation algorithm, and faster to convergence than a Monte Carlo algorithm such as Markov chain Monte Carlo (MCMC). VB often provides solutions with comparable accuracy to MCMC in less time. Though Monte Carlo algorithms provide a numerical approximation to the exact posterior using a set of samples, VB provides a locally-optimal, exact analytical solution to an approximation of the posterior. VB is often more applicable than MCMC to big data or large-dimensional models.

Since VB is deterministic, it is asymptotic and subject to the same limitations with respect to sample size as Laplace Approximation. However, VB estimates more parameters than Laplace Approximation, such as when Laplace Approximation optimizes the posterior mode of a Gaussian distribution, while VB optimizes both the Gaussian mean and variance.

Traditionally, VB algorithms required customized equations. The \code{VariationalBayes} function uses general-purpose algorithms. A general-purpose VB algorithm is less efficient than an algorithm custom designed for the model form. However, a general-purpose algorithm is applied consistently and easily to numerous model forms.

\subsubsection{Salimans2}

The \code{Salimans2} algorithm is the second algorithm of \citet{salimans13} is used. This requires the gradient and Hessian, which is more efficient with a small number of parameters as long as the posterior is twice differentiable. The step size is constant. This algorithm is suitable for marginal posterior distributions that are Gaussian and unimodal. A stochastic approximation algorithm is used in the context of fixed-form VB, inspired by considering fixed-form VB to be equivalent to performing a linear regression with the sufficient statistics of the approximation as independent variables and the unnormalized logarithm of the joint posterior density as the dependent variable. The number of requested iterations should be large, since the step-size decreases for larger requested iterations, and a small step-size will eventually converge. A large number of requested iterations results in a smaller step-size and better convergence properties, so hope for early convergence. However convergence is checked only in the last half of the iterations after the algorithm begins to average the mean and variance from the samples of the stochastic approximation. The history of stochastic samples is returned.

\section{Bayesian-Inference.com} \label{bayesianinferencecom}

Many additional resources may be found at \url{https://web.archive.org/web/20141224051720/http://www.bayesian-inference.com/index}, as well as other places online:

\begin{itemize}
%\item A Bayesian forum is available at \url{http://www.bayesian-inference.com/forum} to discuss all things Bayesian, including \pkg{LaplacesDemon}.
\item Bayesian information is being compiled under \url{https://web.archive.org/web/20150206004608/http://www.bayesian-inference.com/bayesian}.
%\item Bayesian news is aggregated daily as ``The Bayesian Bulletin'': \url{http://www.bayesian-inference.com/newsbayesian}.
%\item Consulting services are available here: \url{http://www.bayesian-inference.com/consulting}.
\item C++ examples of model functions: \url{https://web.archive.org/web/20140513065103/http://www.bayesian-inference.com/cpp/LaplacesDemonExamples.txt}.
\item MCMC algorithms are described at \url{https://web.archive.org/web/20150430054005/http://www.bayesian-inference.com/mcmc}.
\item Merchandise may be found at \url{http://www.zazzle.com/statisticat}, such as \pkg{LaplacesDemon} t-shirts, coffee mugs, and more.
\item \pkg{LaplacesDemon} and \pkg{LaplacesDemonCpp} development is public and occurs at \url{https://github.com/LaplacesDemonR/LaplacesDemon} and \url{https://github.com/LaplacesDemonR/LaplacesDemonCpp}, respectively.
%\item \pkg{LaplacesDemon} screencasts are available at \url{http://www.bayesian-inference.com/softwarescreencasts}.
%\item \pkg{LaplacesDemon} updates are announced at \url{https://plus.google.com/+Bayesian-inference}.
%\item Opinion polls for Bayesian inference and \pkg{LaplacesDemon} are here: \url{http://www.bayesian-inference.com/polls}.
%\item Technical support services are available at \url{http://www.bayesian-inference.com/support}.
\item And, the home of \pkg{LaplacesDemon} is \url{https://web.archive.org/web/20150430054143/http://www.bayesian-inference.com/software}.
\end{itemize}

\section{Conclusion} \label{conclusion}

The \pkg{LaplacesDemon} package is a significant contribution toward Bayesian inference in \proglang{R}. In turn, contributions toward the development of \pkg{LaplacesDemon} are welcome. Please visit \url{https://github.com/LaplacesDemonR} to contribute to development or report software bugs by opening an issue.

\bibliography{References}

\end{document}