---
title: "LaMa and RTMB"
output: rmarkdown::html_vignette
# output: pdf_document
vignette: >
  %\VignetteIndexEntry{LaMa_and_RTMB}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "85%",
  fig.align = "center",
  error = TRUE
)
```

> Before diving into this vignette, we recommend reading the vignettes **Introduction to LaMa**, **Inhomogeneous HMMs** and **Periodic HMMs**.

The recently introduced `R` package `RTMB` conveniently allows for automatic differentiation for non-standard statistical models written in plain `R` code. This enables the estimation of very complicated models, potentially with complex random effect structures. The process feels like magic because you have access to analytic gradients -- *drastically* increasing accuracy and speed -- without doing any calculations!

`LaMa` is now also fully compatible with AD provided by `RTMB`. Hence, estimation of latent Markov models is now faster and more convenient, while model specification is very smooth and less prone to errors -- which at the current state tend to happen when one is not experienced with `RTMB`.

Here we demonstrate how to use `LaMa` and `RTMB` to fit hidden Markov models and their extensions. We always start by loading both packages.

```{r setup}
library(LaMa) # development version
library(RTMB)
```

For the purpose of this vignette, we will analyse the `trex` data set contained in the package. It contains hourly step lengths of a Tyrannosaurus rex, living 66 million years ago, and we aim to understand its behavoural process using HMMs.
```{r data}
head(trex, 5)
```

### Basic workflow

The workflow with `RTMB` is basically always the same. We need to 

* define the negative log-likelihood function, 
* create an automatically differentiable objective function from it and 
* fit the model by numerical minimisation of the latter. 

`RTMB` also provides many functions that make this process very convenient.

### Simple HMM

We start by fitting a super simple stationary HMM with state-dependent gamma distributions for the step lengths and von Mises distributions for the turning angles. As a first step, we define the initial parameter list `par` and a `dat` list that contains the data and potential hyperparameters -- here $N$, the number of hidden states. The names `par` and `dat` are of course arbitrary.

```{r parameters}
par = list(
  logmu = log(c(0.3, 1)),      # initial means for step length (log-transformed)
  logsigma = log(c(0.2, 0.7)), # initial sds for step length (log-transformed)
  logkappa = log(c(0.2, 0.7)), # initial concentration for turning angle (log-transformed)
  eta = rep(-2, 2)             # initial t.p.m. parameters (on logit scale)
  )    

dat = list(
  step = trex$step,   # hourly step lengths
  angle = trex$angle, # hourly turning angles
  N = 2
  )
```

As `par` is a named list of initial parameter values, accessing the parameters later on is much more convenient than indexing. You can also use a parameter vector with `RTMB`, but using a named list makes our life so much easier.

We can now define the negative log-likelihood function in a similar fashion to basic numerical ML

```{r mllk}
nll = function(par) {
  getAll(par, dat) # makes everything contained available without $
  Gamma = tpm(eta) # computes transition probability matrix from unconstrained eta
  delta = stationary(Gamma) # computes stationary distribution
  # exponentiating because all parameters strictly positive
  mu = exp(logmu)
  sigma = exp(logsigma)
  kappa = exp(logkappa)
  # reporting statements for later use
  REPORT(mu); ADREPORT(mu)
  REPORT(sigma); ADREPORT(sigma)
  REPORT(kappa); ADREPORT(kappa)
  # calculating all state-dependent densities
  allprobs = matrix(1, nrow = length(step), ncol = N)
  ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs.
  for(j in 1:N){
    allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j])
  }
  -forward(delta, Gamma, allprobs) # simple forward algorithm
}
```

but a few points should be made here:

-   Most prominently, the negative log-likelihood is a function of the parameters to be estimated *only* while data and other parameters are not passed as an argument at this stage. This is something to get used to (I know), but just the way `RTMB` works.
-   The `getAll()` function is very useful and you should use it in the first line to unpack both the `par` and the `dat` list, making all elements available without the `$` operator. At this stage, `nll` just takes the dat object from the global environment.
-   Parameter transformations are of course still necessary, i.e. all parameters in `par` should be unconstrained.
-   You might wonder how on earth `RTMB` can calculate the gradient of parameters in distributions like the gamma or von Mises distribution. The answer is: It can't but provides its own version of all standard distributions like `dnorm()`, `dbinom()`, etc. In this case both `dgamma2()` and `dvm()` come from `LaMa` as these are non-standard, but under the hood build on `RTMB` functions (`dgamma2()` is actually just a convenience function that reparametrises the gamma distribution in terms of mean and standard deviation).
- Actually, most of the standard functions (e.g. `sum()`), operators (e.g. `%*%`) and methods (e.g. `matrix`) are "overwritten" when called inside `MakeADFun()` but you typically don't notice that and should not care.
-   the `REPORT()` function offered by `RTMB` is really convenient as any quantities calculated in the likelihood function (for which you have written the code anyway), if reported, will be available after optimisation, while the report statements are ignored during optimisation. So no annoying backtransformations anymore, wohoo!
- for simple parameter transformations, `ADREPORT()` is also great, because it calculates standard deviations for `ADREPORT()`ed quantities, based on the delta method. Just note that the delta method is not advisable for complex non-linear and multivariate transformations.

Having defined the negative log-likelihood, we can now create the autmatically differentiable objective function and fit the model. This needs a little explanation: At this point, `RTMB` takes the negative log-likelihood function and generates its own (very fast) version of it, including a gradient. `MakeADFun()` now also grabs whatever is saved as `dat` in the global environment and *bakes* it into the objective function. Therefore, changes to `dat` after this point will have no effect on the optimisation result. We set `silent = TRUE` to suppress printing of the optimisation process.

```{r ADfunction}
obj = MakeADFun(nll, par, silent = TRUE) # creating the objective function
```

Let's check out `obj`:
```{r tmbobject}
names(obj)
```
It contains the initial parameter `par` (now tranformed to a named vector), the objective function `fn` (which in this case just evaluates `nll` but faster), its gradient `gr` and Hessian `he`.

If we now call these functions without any argument, we get the corresponding values at the initial parameter vector.
```{r tmbobject2}
obj$par
obj$fn()
obj$gr()
```

We are now ready to optimise the objective function. The optimisation routine `nlminb()` is very robust and conveniently allows us to provide a gradient function. Alternatively, you can also use `optim()` or any other optimiser you like that allows you to pass a gradient function. 

Indeed, we do not provide the Hessian to `nlminb()` because, while evaluating the Hessian is very fast with `RTMB`, optimisation is still much faster if we use a quasi-Newton algorithm that approximates the current Hessian based on previous gradient evaluations, compared to using full Newton-Raphson.

```{r modelfit}
opt = nlminb(obj$par, obj$fn, obj$gr) # optimization
```

We can check out the estimated parameter and function value by
```{r optpar}
opt$par
opt$objective
```
Note that the naming here is determined by `nlminb()`. If you use a different optimiser, these may be called differently.

Much nicer however, is that `obj` (yes `obj` not `opt`) is automatically updated after the optimisation. Note that calling `obj$gr()` after optimisation now gives the gradient at the optimum, while `obj$fn()` still gives the objective at the starting value and `obj$par` is not updated but still the initial parameter vector (kind of confusing).

To get our estimated parameters on their natural scale, we don't have to do the backtransformation manually. We can just run the reporting:

```{r MLe}
mod = obj$report() # runs the reporting from the negative log-likelihood once
(delta = mod$delta)
(Gamma = mod$Gamma)
(mu = mod$mu)
(sigma = mod$sigma)
(kappa = mod$kappa)
```

which works because of the `REPORT()` statements in the likelihood function. Note that `delta`, `Gamma` and `allprobs` are always reported by default when using `forward()` which is very useful for e.g. state decoding with `viterbi()`, because many downstream `LaMa` functions take these arguments as inputs. Functions of the `viterbi` and `stateprobs` family can also take the reported list object as an input. As the state-dependent parameters depend on the specific model formulation, these need to be reported manually by the user specifying the negative log-likelihood. Having all the parameters, we can plot the decoded time series

```{r decoding, fig.width = 7, fig.height = 4}
# manually
mod$states = viterbi(mod$delta, mod$Gamma, mod$allprobs)

# or simpler
mod$states = viterbi(mod = mod)

# defining color vector
color = c("orange", "deepskyblue")

plot(trex$step[1:200], type = "h", xlab = "time", ylab = "step length", 
     col = color[mod$states[1:200]], bty = "n")
legend("topright", col = color, lwd = 1, legend = c("state 1", "state 2"), bty = "n")
```

or the estimated state-dependent distributions.

```{r statedepdist, fig.width = 8, fig.height = 4}
oldpar = par(mfrow = c(1,2))
hist(trex$step, prob = TRUE, breaks = 40, 
     bor = "white", main = "", xlab = "step length")
for(j in 1:2) curve(delta[j] * dgamma2(x, mu[j], sigma[j]), 
                    lwd = 2, add = T, col = color[j])
curve(delta[1]*dgamma2(x, mu[1], sigma[1]) + delta[2]*dgamma2(x, mu[2], sigma[2]), 
      lwd = 2, lty = 2, add = T)
legend("top", lwd = 2, col = color, legend = c("state 1", "state 2"), bty = "n")

hist(trex$angle, prob = TRUE, breaks = 40, 
     bor = "white", main = "", xlab = "turning angle")
for(j in 1:2) curve(delta[j] * dvm(x, 0, kappa[j]), 
                    lwd = 2, add = T, col = color[j])
curve(delta[1]*dvm(x, 0, kappa[1]) + delta[2]*dvm(x, 0, kappa[2]), 
      lwd = 2, lty = 2, add = T)
par(oldpar) # resetting to default
```

Moreover, we can also use the `sdreport()` function to directly give us standard errors for our unconstrained parameters and everything we `ADREPORT()`ed. 
```{r sdreport}
sdr = sdreport(obj)
```

We can then get an overview of the estimated parameters and `ADREPORT()`ed quantities as well as their standard errors by
```{r sdreport2}
summary(sdr)
```
To get the estimated parameters or their standard errors in list format, type
```{r sdreport3, eval = F}
# estimated parameter in list format
as.list(sdr, "Estimate")
# parameter standard errors in list format
as.list(sdr, "Std")
```
and to get the estimates and standard errors for `ADREPORT()`ed quantities in list format, type
```{r sdreport4, eval = F}
# adreported parameters as list
as.list(sdr, "Estimate", report = TRUE)
# their standard errors
as.list(sdr, "Std", report = TRUE)
```


Lastly, the automatic reporting with `LaMa` and `RTMB` together makes calculating pseudo-residuals really convenient:

```{r pres, fig.width = 8, fig.height = 4}
pres_step = pseudo_res(trex$step, "gamma2", list(mean = mu, sd = sigma), mod = mod)
pres_angle = pseudo_res(trex$angle, "vm", list(mu = 0, kappa = kappa), mod = mod)

oldpar = par(mfrow = c(1,2))
hist(pres_step, prob = TRUE, breaks = 40, 
     bor = "white", main = "pseudo-residuals", xlab = "step length")
curve(dnorm(x, 0, 1), lwd = 2, add = T, lty = 2)
hist(pres_angle, prob = TRUE, breaks = 40, 
     bor = "white", main = "pseudo-residuals", xlab = "turning angle")
curve(dnorm(x, 0, 1), lwd = 2, add = T, lty = 2)
par(oldpar)
```


### Covariate effects

We can now generalise the previous model to include covariate effects. In our example, we might be interested how the T-rex's behaviour varies with the time of day. Hence, we add diel variation to the state process. For example, we can model the transition probabilities as a function of the time of day using a trigonometric basis expansion to ensure diurnal continuity. The transition probabilities are given by
$$
\text{logit}(\gamma_{ij}^{(t)}) = \beta_0^{(ij)} + \beta_1^{(ij)} \sin \bigl(\frac{2 \pi t}{24}\bigr) + \beta_2^{(ij)} \cos \bigl(\frac{2 \pi t}{24}\bigr) + \beta_3^{(ij)} \sin \bigl(\frac{2 \pi t}{12}\bigr) + \beta_4^{(ij)} \cos \bigl(\frac{2 \pi t}{12}\bigr),
$$
where $t$ is the time of day.

To practically achieve this, we compute the trigonometric basis design matrix `Z` corresponding to above predictor and add the time of day to the `dat` list for indexing inside the likelihood function. The `LaMa` function `trigBasisExp()` does this very conveniently.

```{r tod}
# building trigonometric basis desing matrix (in this case no intercept column)
Z = trigBasisExp(1:24, degree = 2) # convenience function from LaMa
# only compute the 24 unique values and index later for entire time series
dat$Z = Z # adding design matrix to dat
dat$tod = trex$tod # adding time of day to dat for indexing
```

We also need to change the parameter list `par` to include the regression parameters for the time of day. The regression parameters for the state process will typically have the form of a $N (N-1) \times p+1$ matrix, where $N$ is the number of states and $p$ is the number of regressors -- this format is also expected by `tpm_g()` which computes the array of transition matrices based on the design and parameter matrix. Another lovely convenience that `RTMB` allows for is that, in our parameter list, we can have matrices, making reshaping of vectors to matrices inside the likelihood function unnessesary.

```{r todpar}
par = list(logmu = log(c(0.3, 1)), 
           logsigma = log(c(0.2, 0.7)),
           logkappa = log(c(0.2, 0.7)),
           beta = matrix(c(rep(-2, 2), 
                           rep(0, 2*ncol(Z))), nrow = 2)) # 2 times 4+1 matrix
# replacing eta with regression parameter matrix, initializing slopes at zero
```

We can now define a more general likelihood function with the main difference being the use of `tpm_g()` instead of `tpm()` and the inclusion of the time of day in the transition matrix calculation. This leads to us using `stationary_p()` instead of `stationary()` to calculate the initial distribuion and `forward_g()` instead of `forward()` to calculate the log-likelihood.

```{r mllk2}
nll2 = function(par) {
  getAll(par, dat) # makes everything contained available without $
  Gamma = tpm_g(Z, beta) # covariate-dependent tpms (in this case only 24 unique)
  # tpm_g() automatically checks if intercept column is included
  ADREPORT(Gamma) # adreporting
  Delta = stationary_p(Gamma) # periodically stationary distribution
  ADREPORT(Delta)
  delta = Delta[tod[1],] # initial periodically stationary distribution
  # exponentiating because all parameters strictly positive
  mu = exp(logmu); REPORT(mu)
  sigma = exp(logsigma); REPORT(sigma)
  kappa = exp(logkappa); REPORT(kappa)
  # calculating all state-dependent densities
  allprobs = matrix(1, nrow = length(step), ncol = N)
  ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs.
  for(j in 1:N){
    allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j])
  }
  -forward_g(delta, Gamma[,,tod], allprobs) # indexing 24 unique tpms by tod in data
}
```

Having done this, the model fit is then essentially the same:

```{r modelfit2}
obj2 = MakeADFun(nll2, par, silent = TRUE) # creating the objective function
opt2 = nlminb(obj2$par, obj2$fn, obj2$gr) # optimization
```

and we can look at the reported results. In this case, for simplicity I get standard errors for `Gamma` with the delta method while, in general, this is not advisable.

```{r MLE2, fig.width = 8, fig.height = 5}
mod2 = obj2$report()

sdr = sdreport(obj2)
Gamma = as.list(sdr, "Estimate", report = TRUE)$Gamma
Gammasd = as.list(sdr, "Std", report = TRUE)$Gamma

Delta = as.list(sdr, "Estimate", report = TRUE)$Delta
Deltasd = as.list(sdr, "Std", report = TRUE)$Delta

tod_seq = seq(0, 24, length = 200) # sequence for plotting
Z_pred = trigBasisExp(tod_seq, degree = 2) # design matrix for prediction

Gamma_plot = tpm_g(Z_pred, mod2$beta) # interpolating transition probs

plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1),
     xlab = "time of day", ylab = "transition probability", bty = "n")
segments(x0 = 1:24, y0 = Gamma[1,2,]-1.96*Gammasd[1,2,], 
         y1 = Gamma[1,2,]+1.96*Gammasd[1,2,])
segments(x0 = 1:24, y0 = Gamma[2,1,]-1.96*Gammasd[2,1,], 
         y1 = Gamma[2,1,]+1.96*Gammasd[2,1,])
lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3)
legend("topleft", lwd = 2, lty = c(1,3), bty = "n",
       legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t))))
plot(Delta[,2], type = "b", lwd = 2, xlab = "time of day", ylab = "Pr(active)", 
     col = "deepskyblue", bty = "n", xaxt = "n")
segments(x0 = 1:24, y0 = Delta[,2]-1.96*Deltasd[,2], lwd = 2,
         y1 = Delta[,2]+1.96*Deltasd[,2], col = "deepskyblue")
axis(1, at = seq(0,24,by=4), labels = seq(0,24,by=4))
```

<!-- ### Penalized splines -->

<!-- We can go one step further and model the transition probabilities as smooth functions of the time of day using cyclic P-splines, i.e. -->
<!-- $$ -->
<!-- \text{logit}(\gamma_{ij}^{(t)}) = \beta_0^{(ij)} + s_{ij}(t), -->
<!-- $$ -->
<!-- where $s_{ij}(t)$ is a smooth periodic function of time of day. `LaMa` provides the function `make_matrices()` which creates design and penalty matrices based on the R package `mgcv` when provided with a formula and data. Hence, we can use standard `mgcv` syntax to create the matrices for cyclic P-splines (`cp`). We then append both to the `dat` list. -->

<!-- ```{r tod2} -->
<!-- modmat = make_matrices(~ s(tod, bs = "cp"),  -->
<!--                        data = data.frame(tod = 1:24), -->
<!--                        knots = list(tod = c(0,24))) # where to wrap the cyclic basis -->
<!-- Z = modmat$Z # spline design matrix -->
<!-- S = modmat$S # penalty matrix -->
<!-- dat$Z = Z -->
<!-- dat$S = S[[1]] # mgcv returns a list of penalty matrices (even if only one smooth) -->
<!-- ``` -->

<!-- We have to change our likelihood function slightly by adding the penalization. For this we use the `penalty()` function contained in `LaMa` that computes the sum of quadratic form penalties (the standard penalty used for penalized splines) based on the penalty matrices, the parameters to be estimated and the penalty strength parameters. -->

<!-- Importantly, we now have to separate the non-penalized intercept `beta0` from the penalized spline coefficients now called `betaspline`. The latter, we again conveniently initialize as a matrix, each row representing the coefficient vector for one off-diagonal element of the t.p.m. -->

<!-- ```{r mllk3} -->
<!-- pnll = function(par) { -->
<!--   getAll(par, dat) # makes everything contained available without $ -->
<!--   Gamma = tpm_g(Z, cbind(beta0, betaspline)); ADREPORT(Gamma) -->
<!--   Delta = stationary_p(Gamma); ADREPORT(Delta) -->
<!--   delta = Delta[tod[1],] -->
<!--   # exponentiating because all parameters strictly positive -->
<!--   mu = exp(logmu); REPORT(mu) -->
<!--   sigma = exp(logsigma); REPORT(sigma) -->
<!--   kappa = exp(logkappa); REPORT(kappa) -->
<!--   # calculating all state-dependent densities -->
<!--   allprobs = matrix(1, nrow = length(step), ncol = N) -->
<!--   ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs. -->
<!--   for(j in 1:N){ -->
<!--     allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j]) -->
<!--   } -->
<!--   -forward_g(delta, Gamma[,,tod], allprobs) + -->
<!--     penalty(betaspline, S, lambda) # this does all the penalization work -->
<!-- } -->
<!-- ``` -->

<!-- We also have to append a `lambda` argument to our `dat` list, which is the initial penalty strength parameter vector. In this case of length two because our coefficient matrix has two rows. -->

<!-- ```{r todpar2} -->
<!-- par = list(logmu = log(c(0.3, 2.5)),  -->
<!--            logsigma = log(c(0.2, 1.5)), -->
<!--            logkappa = log(c(0.2, 1.5)), -->
<!--            beta0 = c(-2, 2), # intercept now separated! -->
<!--            betaspline = matrix(rep(0, 2*(ncol(Z)-1)), nrow = 2)) -->

<!-- dat$lambda = rep(100, 2) # adding initial penalty strength to the dat list -->
<!-- ``` -->

<!-- The model fit can then be conducted by using the `qreml()` function contained in `LaMa`. **qREML** stands for **quasi restricted maximum likelihood** and finds a good penalty strength by treating the spline coefficients as random effects. Under the hood, `qreml()` also constructs an AD function with `RTMB` but uses the **qREML** algorithm described in Koslik (2024) to fit the model. We have to tell the `qreml()` function which parameters are spline coefficients by providing the name of the corresponding list element of `par`. -->

<!-- There are some rules to follow when using `qreml()`: -->

<!-- 1. The likelihood function needs to be `RTMB`-compatible, i.e. have the same structure as all the likelihood functions in our vignette -- most importantly, it should only be a function of the parameter list. -->
<!-- 3. The penalty strength vector `lambda` needs its length to correspond to the *total* number of spline coefficient vectors used. In our case, this is the number of rows of betaspline, but if we additionally had a different spline coefficient in our parameter list (that may have a different length and a different penalty matrix), we would have needed more elements in `lambda`. -->
<!-- 4. The `penalty()` function can only be called *once* in the likelihood. If several spline coefficients are penalized, `penalty()` expects a list of coefficient matrices or vectors and a list of penalty matrices. -->
<!-- 5. When we summarise multiple spline coefficients in a matrix in our parameter list -- which is very useful when these are of same lengths and have the same penalty matrix -- this matrix must be arranged by row, i.e. each row is one spline coefficient vector. If it is arranged by column, `qreml()` will fail. -->

<!-- ```{r qreml, message = FALSE} -->
<!-- system.time( -->
<!--   mod3 <- qreml(pnll, par, dat, random = "betaspline") -->
<!-- ) -->
<!-- ``` -->
<!-- The `mod` object is now a list that contains everything that is reported by the likelihood function, but also the `RTMB` object created in the process. After fitting the model, we can also use the `LaMa` function `pred_matrix()`, that takes the `modmat` object we created earlier, to build a new interpolating design matrix using the exact same basis expansion specified above. This allows us to plot the estimated transition probabilities as a smooth function of time of day -- I now ignore confidence bands due to laziness. -->

<!-- ```{r results qreml, fig.width = 8, fig.height = 5} -->
<!-- sdr = sdreport(mod3$obj) -->
<!-- Gamma = as.list(sdr, "Estimate", report = TRUE)$Gamma -->
<!-- Delta = as.list(sdr, "Estimate", report = TRUE)$Delta -->

<!-- tod_seq = seq(0,24, length=200) -->
<!-- Z_pred = pred_matrix(modmat, data.frame(tod = tod_seq)) -->

<!-- Gamma_plot = tpm_g(Z_pred, mod3$beta) # interpolating transition probs -->

<!-- plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1), -->
<!--      xlab = "time of day", ylab = "transition probability", bty = "n") -->
<!-- lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3) -->
<!-- legend("topleft", lwd = 2, lty = c(1,3), bty = "n", -->
<!--        legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t)))) -->
<!-- plot(Delta[,2], type = "b", lwd = 2, xlab = "time of day", ylab = "Pr(active)",  -->
<!--      col = "deepskyblue", bty = "n", xaxt = "n") -->
<!-- axis(1, at = seq(0,24,by=4), labels = seq(0,24,by=4)) -->

<!-- ``` -->

<!-- We see that by allowing for a more flexible relationship, the estimated time of day effect becomes stronger with even sharper peaks than we would have concluded using the trigonometric approach. -->

<!-- ### Full Laplace method -->

<!-- Lastly, we could have achieved a similar fit as above using the slightly more accurate full Laplace approximation method, which can be used to fit models via marginal maximum likelihood estimation by integrating out the random effects. This is natively supported by `RTMB` -- and actually one of its core selling points -- and the standard way we can now deal with all kinds of random effects.  -->

<!-- Indeed, the **qREML** algorithm above treats the spline coefficients as Gaussian random effects but exploits their relatively simple structure yiedling a more efficient fitting method. The full Laplace method is much more general, allowing for very flexible random effects, but here, estimation slower because it does not exploit the simple structure of splines treated as random effects. -->

<!-- We have to alter our likelihood function slightly, because for the Laplace method, we need to implement the joint likelihood of the data and the random effect, the latter having a multivariate normal distribution. Specifically, if $b$ is our random effect for a spline, $b \sim N(0, \lambda^{-1} S^-)$. The likelihood of the data given $b$ (just our regular likelihoot that treats $b$ as a parameter) is $f(x \mid b)$ and the density of $b$ is $f_{\lambda}(b)$. Hence the joint likelihood can be computed as -->
<!-- $$ -->
<!-- f(x, b) = f(x \mid b) f_{\lambda}(b) -->
<!-- $$ -->
<!-- and the joint negative log-likelihood becomes $- \log f(x \mid b) - \log f_{\lambda}(b)$ and this is what we implement below. -->

<!-- Most conveniently this is done by using the `dgmrf2()` function included in `LaMa` which provides the density function of the multivariate normal distribution reparametrized in terms of the (scaled) precision matrix, i.e. inverse covariance matrix, which in our case is $\lambda_i S$ for spline $i$. It allows evaluating at multiple points at once, each one possibly with its own penalty strength parameter `lambda`. It differs from `RTMB`'s `dgmrf()` by not expecting a *sparse* precision matrix and being more robust for rank-deficient penalty matrices, which are typical for penalized splines. -->

<!-- ```{r mllk4} -->
<!-- jnll = function(par) { -->
<!--   getAll(par, dat) # makes everything contained available without $ -->
<!--   Gamma = tpm_g(Z, cbind(beta0, betaspline)); ADREPORT(Gamma) -->
<!--   Delta = stationary_p(Gamma); ADREPORT(Delta) -->
<!--   delta = Delta[tod[1],] -->
<!--   # exponentiating because all parameters strictly positive -->
<!--   mu = exp(logmu); REPORT(mu) -->
<!--   sigma = exp(logsigma); REPORT(sigma) -->
<!--   kappa = exp(logkappa); REPORT(kappa) -->
<!--   # calculating all state-dependent densities -->
<!--   allprobs = matrix(1, nrow = length(step), ncol = N) -->
<!--   ind = which(!is.na(step) & !is.na(angle)) # only for non-NA obs. -->
<!--   for(j in 1:N){ -->
<!--     allprobs[ind,j] = dgamma2(step[ind],mu[j],sigma[j])*dvm(angle[ind],0,kappa[j]) -->
<!--   } -->
<!--   -forward_g(delta, Gamma[,,tod], allprobs) - -->
<!--     sum(dgmrf2(betaspline, 0, S, exp(loglambda), log = TRUE)) # just like any other density in R -->
<!-- } -->
<!-- ``` -->

<!-- We also have to include the log of our penalty strength as a parameter now. -->

<!-- ```{r todpar3} -->
<!-- par$loglambda = log(rep(100, 2)) -->
<!-- ``` -->

<!-- To create the objective function, we need to tell `RTMB` that `betaspline` is a random effect such that it is integrated out and as our objective function we have the marginal likelihood -->
<!-- $$ -->
<!-- f(x) = \int f(x, b) \,db, -->
<!-- $$ -->
<!-- actually its negative log of course. -->

<!-- ```{r refit, message = FALSE, eval = FALSE} -->
<!-- obj4 = MakeADFun(jnll, par, random = "betaspline", silent = TRUE) -->
<!-- system.time( -->
<!--   opt4 <- nlminb(obj4$par, obj4$fn, obj4$gr) -->
<!-- ) -->
<!-- ``` -->

<!-- This more general algorithm takes more than ten times model as long to fit the model. Hence, the above code is not evaluated. The results however are very similar. -->

<!-- ```{r results refit, fig.width = 8, fig.height = 5}
mod4 = obj4$report()
mod4$Gamma = tpm_g(Z, mod4$beta) # calculating 24 tpms

Gamma_plot = tpm_g(Z_pred, mod4$beta)
plot(tod_seq, Gamma_plot[1,2,], type = "l", lwd = 2, ylim = c(0,1),
     xlab = "time of day", ylab = "transition probability", bty = "n")
lines(tod_seq, Gamma_plot[2,1,], lwd = 2, lty = 3)
legend("topleft", lwd = 2, lty = c(1,3), bty = "n",
       legend = c(expression(gamma[12]^(t)), expression(gamma[21]^(t))))
``` -->

### Common issues with `RTMB`

There are some problems with `RTMB` one has to keep in mind. They can be a bit annoying, but in my opinion the benefits of automatic differentiation far outweigh the drawbacks. I list the main ones I have encountered here, but please tell me if you encounter more, such that they can be added.

A typical issue with `RTMB` is that some operators might need to be overloaded to allow for automatic differentiation which cannot be done by default. In typical model setups `LaMa` functions do this themselves, but if you go a very individualistic route and get an error like 
```{r error, eval = FALSE}
stop("Invalid argument to 'advector' (lost class attribute?)")
```
you might have to overload the operator yourself. To do this put
```{r overloading}
"[<-" <- ADoverload("[<-")
```
as the first line of your likelihood function. If the error still prevails also add
```{r overloading2}
"c" <- ADoverload("c")
"diag<-" <- ADoverload("diag<-")
```
which should hopefully fix the error.

Another common problem occurs when initiating objects with `NA` values and then trying to fill them with `numeric` values. This is because `NA` is logical which screws up the automatic differentiation due to the mismatching types. To avoid this, always initiate with `numeric` or `NaN` values. For example, don't do
```{r NA, eval = FALSE}
X = array(dim = c(1,2,3))
# which is the same as
X = array(NA, dim = c(1,2,3))
```
but rather
```{r NaN, eval = FALSE}
X = array(NaN, dim = c(1,2,3))
# or
X = array(0, dim = c(1,2,3))
```
to avoid the error.

Importantly, you cannot use `if` or `max`/ `min` statements **on the parameter itself** as these are not differentiable. If you do so, `RTMB` will fail and probably does not produce a helpful error message. The problem here results from `RTMB` building the *tape* (computational graph) of the function at the initial parameter value. When you have `if` statements, the resulting gradient will be different from the one at a different parameter value. Often, you can remedy this behaviour by exploiting the fact that `abs()` is differentiable (in code). For example, you can create the differentiable `max` alternative:
```{r max2, eval = FALSE}
max2 = function(x,y){
  (x + y + abs(x - y)) / 2
}
```
So you might be able to solve such problems by finding a clever alternative. If the `if` statement does not involve the parameter, it will typically be fine because it does not change during the optimisation.

Furthermore, there are some unfortunate side effects of R's 'byte compiler' (enabled by default in R). So if you encounter an error not matching the previous ones, try disabling the byte compiler with
```{r bytecompiler, message = FALSE}
compiler::enableJIT(0)
```
and see if the error is resolved.

Some more minor things:

* if you're used to `expm::expm()` that won't work with AD. Use `Matrix::expm()` instead.
* `CircStats::dvm()` also isn't compatible with AD. Use `LaMa::dvm()` instead.
* most of the standard distributions are available in `RTMB`. If you need a non-standard one, try implementing the density function yourself using plain R code. 
`RTMB` also provides AD versions of many building-block functions (like the Gamma or Bessel function) which might help with this.

For more information on `RTMB`, check out its [documentation](https://CRAN.R-project.org/package=RTMB) or the [TMB users Google group](https://groups.google.com/g/tmb-users).