---
title: "MRgrowth: Growth Models for Mark-Recapture Data"
author: "Donald T. McKnight"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with MRgrowth}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

'MRgrowth' provides tools for estimating body-size growth rates from mark-recapture data. It is intended for situations where individuals do not have known ages, a common scenario in field studies of reptiles and other long-lived animals, and it has been designed to be as simple and easy to use as possible. 

'MRgrowth' supports three growth models:

- **von Bertalanffy** (`"vb"`; using the Faben's reformulation)
- **Gompertz** (`"gompertz"`)
- **Logistic** (`"logistic"`)

This vignette walks through a complete workflow using a turtle mark-recapture dataset.

---


## Example data

'MRgrowth' includes a built-in example data set `MRdata`, which contains mark-recapture data for a turtle population. Each row represents a single capture event for an individual.

```{r data}

library(MRgrowth)

data(MRdata)
head(MRdata)
str(MRdata)
```

The data set contains three columns:

- `id` — a unique identifier for each individual
- `date` — the date of the capture event (in Date format; POSIX formats with associated times are also supported)
- `size` — the body size of the individual at capture (the example is in mm, but any units are acceptable)

Some individuals were captured more than twice, and there are some missing values. 'MRgrowth' handles both situations through the `format.data()` function described below.

---

## Step 1: Format the data

Raw mark-recapture data are typically in long format (one row per capture event), but 'MRgrowth' requires a wide format (one row per pair of capture events). The `format.data()` function handles this conversion. Filtering for NAs and individuals that were only captured once is handled internally and you do not need to filter the data beforehand (see below).

```{r format}
formatted.data <- format.data(
  x        = MRdata,
  id.col   = "id",
  date.col = "date",
  size.col = "size",
  units    = "years",
  retain   = "max"
)

head(formatted.data)
```

The output contains one row per pair of capture events, with columns for:

- `id` — individual identifier
- `size1` — size at first capture
- `size2` — size at recapture
- `date1` / `date2` — dates of each capture
- `td` — time elapsed between captures in the units you specified

Only the size1, size2, and td columns are needed for subsequent functions. The date1 and date2 columns are provided in case users need to match this output with other aspects of their data.

**Note:** Sizes can be in any units you like, and those units will be carried throughout all subsequent functions. Thus, if you enter your data in mm, all results are for mm. Likewise, whatever time unit you set in `format.data()` will be the time unit for all subsequent functions and results.

### Handling multiple recaptures

The `retain` argument controls what happens when an individual was captured more than twice. As a general rule, individuals should only be included once to avoid pseudoreplication unless you are running a model with an error structure that handles replication per individual (those models are not currently supported in 'MRgrowth')

- `"max"` (default) — uses the first and last capture, maximizing the time interval and therefore the information available for estimating growth. This is recommended in most situations.
- `"random"` — randomly selects two captures. Use the `seed` argument for reproducibility.
- `"all"` — retains all consecutive pairs of captures (e.g., captures 1&2, 2&3, 3&4). Note that this introduces **pseudoreplication** because measurements from the same individual are not independent, and should be used with caution.

### NAs and data filtering

Any capture events with missing data are removed before the `retain` argument is applied. This means that if an individual was captured three times but had missing size data at the first capture, the function will use the second and third captures even if `retain = "max"`.

Additional, any individuals that were only measured once are filtered out of the data set. 

---

## Step 2: Fit growth models

The `calc.growth()` function is the primary function in 'MRgrowth'. It fits one or more growth models to the formatted data, returns the fitted values of asymptotic size (A) and growth rate (k), calculates bootstrapped confidence intervals, and optionally produces a plot of growth curves.

### Key arguments

**`size0`** — the body size at birth or hatching. This does not affect the estimates of A or k, but it does affect the predicted sizes at each age in `age.vect`. Use a species-typical birth or hatchling size. If you do not provide a value, the function will use half the smallest observed size, which may be unrealistic.

**`age.vect`** — a vector of ages for which predicted sizes and confidence intervals will be calculated. Choose a range that covers the full lifespan of your species. If not provided, the function defaults to ages 1 through 4 times the largest observed time interval, which may not cover the full age range. The number of values in this vector has very little impact on run time, so it is best to use fine-scale intervals (e.g., every year) to produce smooth curves.

**`models`** — a vector of models to run and evaluate (any combination of "vb","logistic","gompertz"). The default is to run and compare all three.

**`runs`** — 95% confidence intervals are calculated by bootstrapping, and the runs argument determines the number of bootstrap replicates (higher values = better estimates but longer run times). The default of 5000 is generally appropriate for publication. Lower values (e.g., 100) can be used during exploratory analysis to reduce run time. This argument only affects the confidence intervals, not the the actual estimates of A, k, or model diagnostics (e.g., (AIC). Therefore, for large data sets, you may want to do an initial run with bootstrapping essentially turned off by setting it to 1. Use that run to determine which model is the best fit, then run a second analysis using a high number of runs (e.g., 5000) and only the best fit model. Note that the seed argument is passed internally to `set.seed()`, so results are repeatable.

**`type`** — by default, 'MRgrowth' uses maximum likelihood ("ML") estimation, but it can be changed to least squares ("OLS"). Usually, ML and OLS will produce identical results; however, OLS will not return AIC values. Occasionally, ML may result in convergence issues which are not present in OLS. Therefore, if an ML model does not converge, you may want to try it with OLS.

**`fixed.A`** — in a limited number of cases, you may want to evaluate only the k parameter and supply a fixed asymptotic value (A) based on existing knowledge (e.g., a different study of a different population). As a general rule, this is not advisable, but can be used in some cases such as small sample sizes. Interpreting comparisons of models with and without a fixed A value is complex. AICs are inherently penalized by the number of parameters involved; thus, models with fixed.A have fewer parameters, which will result in lower AICs unless the fit is substantially worse. As a result, if you run models with and without a fixed A, and the fixed values are close to the values that would have been estimated, you will get lower AICs even if the other parameters (k) do not actually fit as well. This is simply because you had one fewer parameter penalizing the AIC score. Further, for those AICs to be valid, the value of A must have been derived entirely independently (e.g., a different population). If it was based on a previous run or some aspect of the population being studied (e.g., max observed body size) then it is not independent and the AIC score has not been accurately penalized for the number of parameters. To help with this, two sets of AIC scores are returned when a value is supplied for fixed.A. The "fixed" scores were calculated based on two parameters being estimated (k and sigma). These are the "correct" AICs for models with a fixed A, but they come with the caveats described above. The "AIC" and "AICc" columns contain the AICs based on three parameters being estimated (A, k, and sigma), and may be more appropriate if the selection of A was in anyway biased. Other model diagnostics like convergence issues and NLL should also be consulted. 

**`other arguments`** — The other inputs are set to reasonable defaults and, in most cases, they do not need to be altered. Indeed, in most cases, only the data set, size0, and age.vect need to be supplied for a useful run (assuming your data is the output of `format.data()`).

Adjusting initial.A and initial.k can result in slight improvements in run time, but they are generally not noticeable. In concept, initial.A and initial.k values could affect convergence in the ML models, but to avoid this `calc.growth()` internally uses the initial values in an OLS model to produce suitable inputs for an ML model, and the final results are based on the ML model. So, in practice, initial values have no effect on the final result.

---

### Basic usage

Let's run a simple comparison of all three models, looking at the expected size at ages 1-50 (years), with a hatchling size of 50 mm. For this example, we will keep runs low (100) to reduce run time, but for actual analyses, you want several thousand runs. Most values will be left on their default settings.

```{r fit_fast}
# Using low runs for vignette build speed
results <- calc.growth(
  x           = formatted.data,
  size0       = 50,
  age.vect    = 1:50,
  models      = c("vb", "gompertz", "logistic"),
  type        = "ML",
  runs        = 100,
  return.plot = TRUE
)
```

## Step 3: Examine the output

`calc.growth()` returns a list with two elements.

### Parameters

```{r params}
results$parameters
```

Each row contains the fitted parameters and diagnostics for one model:

- **`A`** — the estimated asymptotic size. Note that A is an average asymptotic size for the population. Thus, you will probably have some individuals larger than A.
- **`k`** — the growth rate coefficient. Higher values indicate faster growth toward the asymptote.
- **`A.lower` / `A.upper`** and **`k.lower` / `k.upper`** — the 95% bootstrapped confidence intervals for each parameter.
- **`convergence`** — a code from `optim()`. A value of `0` indicates the optimizer converged successfully. Any other value indicates a convergence issue and the results should be treated cautiously.
- **`NLL`** — the negative log-likelihood of the fitted model (only returned if type = "ML" [default]).
- **`SS`** — the sum of squares of the fitted model (only returned if type = "SS").
- **`AIC`** / **`AICc`** — information criteria for model comparison (only returned if type = "ML" [default]; see the fixed.A section in step 2).

### Sizes at ages

```{r sizes}
head(results$sizes.at.ages)
```

This data frame contains the predicted size at each age in `age.vect` for each model, along with the upper and lower bootstrapped 95% confidence intervals. These are the data behind the growth curve plot and are available for more refined plotting in ggplot2 or other graphing software.

---

## Step 4: Compare models

When multiple models are fitted, the AIC and AICc values in `results$parameters` can be used to compare them. Lower values indicate a better fit relative to model complexity.

NLL or SS values can also be useful (lower values = better fits).

Make sure to check the convergence column. Values other than 0 indicate convergence issues and generally should not be trusted even if they have low AICs.

In this case, no models had convergence issues and the 'vb' (von Bertalanffy) appears to be the best fit, so we will use it in all subsequent steps.

```{r}
#look at model diagnostics
print(results$parameters)


# Extract parameters from the best-fitting model
best.model <- results$parameters[which(results$parameters$model=="vb"), ]

```
Additional checks can be performed by using the resulting model parameters and the `age.at.size()` and `size.at.age()` functions (see Step 5) 

---

## Step 5: Additional utility functions

Once you have determined the best fit model, it is a good idea to interrogate it further to see how well it actually fits the data. Simply being the "best" model out of the three we tried does not mean that the model is actually a good fit (it may simply be not as bad as the others). This section first walks introduces some aditional functions, then explains how to use them to interrogate your model.

### Predicted size at a given age

Once you have estimates of A and k, you can use `size.at.age()` to predict the size of an individual at any age.

```{r size_at_age}
#calculate expected sizes for a 10, 20 and 30 year old turtle
#note that the model argument specifies that we are using a vb model

size.at.age(
  A     = best.model$A,
  k     = best.model$k,
  age   = c(10, 20, 30),
  size0 = 50,
  model = "vb")

```

This will give the same results as the $sizes.at.ages output from `calc.growth()`, but you can use it to examine different sizes, different birth/hatching sizes, or slight variations in A and k.  

### Predicted age at a given size

Once you have estimates of A and k, you can use `age.at.size()` to estimate the age of an individual based on its size.

```{r age_at_size}

#Estimate the age (years) of a 150, 250, and 350 mm turtle
#note that the model argument specifies that we are using a vb model

age.at.size(
  A     = best.model$A,
  k     = best.model$k,
  size  = c(150, 250, 350),
  size0 = 50,
  model = "vb")

```

Note that ages cannot be calculated for sizes greater than A; these will return `NaN`.

### Predicted size after a time interval

The three growth functions (`vb()`, `gompertz()`, `logistic()`) can also be used directly to predict the size of an individual after a given time interval, starting from a known size. Note again that all units (including time) should be the same as the initial units used in `format.data()`.

```{r growth_funcs}
# Predict size of a 200 mm turtle after 3 years using the vb model
#note that unlike most functions in 'MRgrowth', each model as a separate function, rather than specifying the model in a single function

vb(
  A     = best.model$A,
  k     = best.model$k,
  size1 = 200,
  td    = 3
)
```

### Further model interrogations

You can use these functions to perform further checks on your model.

#### Comparing observed and predicted sizes at recapture

We can use the size1 and time difference in our formatted input data to predict what size turtles should be at recapture. We can then compare those sizes to the size2 data to see how well the model predicted them.

```{r}

predicted.size2 <- vb(
  A     = best.model$A,
  k     = best.model$k,
  size1 = formatted.data$size1, #use size 1 data
  td    = formatted.data$td #use time difference in the data
)

plot(x = formatted.data$size2,
     y = predicted.size2,
     xlab = "Observed size2",
     ylab = "Estimated size2",
     pch = 16,
     col = "blue",
     bg  = "white",
     panel.first = rect(par("usr")[1], par("usr")[3],
                        par("usr")[2], par("usr")[4],
                        col = "white"))
abline(lm(predicted.size2 ~ formatted.data$size2),
       col = "black",
       lwd = 1)
r2 <- summary(lm(predicted.size2 ~ formatted.data$size2))$r.squared
legend("topleft",
       legend = bquote(R^2 == .(round(r2, 4))),
       bty    = "n")

```

Here you can see that the values match up pretty well with a high R2. So the model did a good job. If the R2 was low, it would indicate that even the "best" model is doing a poor job of fitting this data and should be treated with skepticism.

Another important check is the spacing of the residuals along the line. If, for example, there were large deviations only on the left (small side), it would suggest that the model is doing a good job of estimating growth of large individuals, but struggling with small individuals.

**Note:** Because of the models' math, for `vb()`, `gompertz()`, and `logistic()` any time that size1 is < A, the estimated size will be less than size 1. Therefore, you may want to either filter those out (the best option for R2 values) or replace them with size1 in the output.

```{r}
#remove any individuals where size1 > A
observed.size2.b <- formatted.data$size2[which(formatted.data$size1 < best.model$A)]
predicted.size2.b <- predicted.size2[which(formatted.data$size1 < best.model$A)]

plot(x = observed.size2.b,
     y = predicted.size2.b,
     xlab = "Observed size2",
     ylab = "Estimated size2",
     pch = 16,
     col = "blue",
     bg  = "white",
     panel.first = rect(par("usr")[1], par("usr")[3],
                        par("usr")[2], par("usr")[4],
                        col = "white"))
abline(lm(predicted.size2.b ~ observed.size2.b),
       col = "black",
       lwd = 1)
r2 <- summary(lm(predicted.size2.b ~ observed.size2.b))$r.squared
legend("topleft",
       legend = bquote(R^2 == .(round(r2, 4))),
       bty    = "n")

```

##### Comparing observed and predicted sizes and ages for individuals with known-age data

If you have a subset of individuals with known ages, you can use the `size.at.age()` and `age.at.size()` functions to further check the models.

Let's say, for example, that in addition to the animals used to generate the models, we have 4 individuals of know ages. They are 5, 10, 12, and 20 years old, and are 200, 320, 330, and 400 mm. Let's start by using their ages to predict their sizes. We'll convert the results into percent differences to make them easier to interpret.

```{r}
predicted.size <- size.at.age(A=best.model$A,
                              k=best.model$k,
                              age=c(5,10,12,20),
                              size0=50,
                              model="vb")

#expected sizes given their ages
print(predicted.size)

#percent difference between observed and predicted sizes
print((c(200,320,330,400)-predicted.size)/c(200,320,330,400)*100)
```

Mostly, those size estimates line up pretty well. As a point of comparison, let's do the same but with the logistic model, which did not fit as well.

```{r}
logistic.model <- results$parameters[which(results$parameters$model=="logistic"), ]

predicted.size.logistic <- size.at.age(A=logistic.model$A,
                              k=logistic.model$k,
                              age=c(5,10,12,20),
                              size0=50,
                              model="logistic")

#predicted sizes based on the logistic model
print(predicted.size.logistic)

#percent difference between observed and predicted sizes for the logistic model
print((c(200,320,330,400)-predicted.size.logistic)/c(200,320,330,400)*100)

```

These estimates are much further off, especially for smaller individuals, which makes sense, given that the logistic model did not fit as well.

Now we can invert things and look at the expected ages for those sizes.

```{r}
predicted.age <- age.at.size(A=best.model$A,
                              k=best.model$k,
                              size=c(200,320,330,400),
                              size0=50,
                              model="vb")

print(predicted.age)
```

Again, not too bad, suggesting that that the vb model is doing a good job.

---

## References

Edmonds, D., et al. (2021). Growth and population structure of a long-lived reptile.
*PLOS ONE*. <doi:10.1371/journal.pone.0259978>

von Bertalanffy, L. (1957). Quantitative laws in metabolism and growth.
*The Quarterly Review of Biology*, 32, 217--231.

Fabens, A.J. (1965). Properties and fitting of the von Bertalanffy growth curve.
*Growth*, 29, 265--289.

Gompertz, B. (1833). On the nature of the function expressive of the law of human
mortality, and on a new mode of determining the value of life contingencies.
*Philosophical Transactions of the Royal Society of London*, 123, 252--253.

Schoener, T.W. and Schoener, A. (1978). Estimating and interpreting body-size
growth in some Anolis lizards. *Copeia*, 1978, 390--405.