---
title: "Cross-validation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Cross-validation}
%\VignetteEngine{knitr::knitr}
%\VignetteEncoding{UTF-8}
---
### ✘ Definition
Whether independent data is available or not, **data-splitting** methods allow to divide input data into pieces to **calibrate** and **validate** the models on different parts.
Most common procedures either split randomly the original dataset in two parts (**random**) with higher proportion to calibrate the models ; or in k datasets of equal sizes (**k-fold**), each of them being used in turn to validate the model, while the remaining is used for calibration. For both methods, the splitting can be repeated several times.
Other procedures are available to test for model overfitting and to assess transferability either in geographic or environmental space : **block** method described in *Muscarella et al. 2014* partitions data in four bins of equal size (bottom-left, bottom-right, top-left and top-right), while **x-y-stratification** described in *Wenger and Olden 2012* uses k partitions along the x-(or y-) gradient and returns 2k partitions ; **environmental** partitioning returns k partitions for each environmental variable provided.
These methods can be balanced over presences or absences to ensure equal distribution over space, especially if some data is clumped on an edge of the study area.
The user can also define its own data partitioning (**user.defined**).
### ✠ How you can select methods
`biomod2` allows you to use different strategies to separate your data into a calibration dataset and a validation dataset for the cross-validation. With the argument `CV.strategy` in [`BIOMOD_Modeling`](../reference/BIOMOD_.html), you can select :
Random
The most simple method to calibrate and validate a model is to split the original dataset in two datasets : one to calibrate the model and the other one to validate it. The splitting can be repeated `nb.rep` times. You can adjust the size of the splitting between calibration and validation with `perc`.
K-fold
The `k-fold` method splits the original dataset in `k` datasets of equal sizes : each part is used successively as the validation dataset while the other `k-1` parts are used for the calibration, leading to `k` calibration/validation ensembles. This multiple splitting can be repeated `nb.rep` times.
Block
It may be used to test for model overfitting and to assess transferability in geographic space. `block` stratification was described in *Muscarella et al. 2014* (see References). Four bins of equal size are partitioned (bottom-left, bottom-right, top-left and top-right).
Stratified
It may be used to test for model overfitting and to assess transferability in geographic space. `x` and `y` stratification was described in *Wenger and Olden 2012* (see References). `y` stratification uses `k` partitions along the y-gradient, `x` stratification does the same for the x-gradient. both returns `2k` partitions: `k` partitions stratified along the x-gradient and `k` partitions stratified along the y-gradient.
You can choose `x`, `y` and `both` stratification with the argument `strat`.
Environmental
It may be used to test for model overfitting and to assess transferability in environmental space. It returns `k` partitions for each variable given in `env.var`.You can choose if the presences or the absences are balanced over the partitions with `balance`.
User-defined
Allow the user to give its own cross-validation table. For a presence-absence dataset, column names must be formatted as: `_allData_RUNx` with `x` an integer. For a presence-only dataset for which several pseudo-absence dataset were generated, column names must be formatted as: `_PAx_RUNy` with `x` an integer and `PAx` an existing pseudo-absence dataset and `y` an integer
### ✣ Evaluation
`biomod2` allows to use a cross-validation method to build (**calibration**) and validate (**validation**) the model, but it can also be tested on another independent dataset if available (**evaluation**). This second independent dataset can be integrated with the `eval.resp`, `eval.xy` and `eval.env.data` parameters in the [`BIOMOD_FormatingData`](../reference/BIOMOD_FormatingData.html) function.
**Note that**, if you can have as many *evaluation* values (with *dataset 2*) as your number of cross-validation splitting for single models, you can have only one *evaluation* value for ensemble models.
This can be circumvented by using `em.by = 'PA+run'` within the [`BIOMOD_EnsembleModeling`](../reference/BIOMOD_EnsembleModeling.html) function to build for each cross-validation fold an ensemble model across algorithms.
You will obtain as many ensemble models as cross-validation split, and thus as many *evaluation* values. But you will also have several ensemble models, which may defeat your purpose of having a final single model.
### ✜ Concretely
_All the examples are made with the data of the package._
_For the beginning of the code, see the [main functions vignette](examples_1_mainFunctions.html)._
#### ⛒ Cross-validation
To do a random cross-validation method with 2 runs and with a distribution 80/20 for calibration and validation.
```R
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'Example',
models = c('RF', 'GLM'),
CV.strategy = 'random',
CV.nb.rep = 2,
CV.perc = 0.8,
metric.eval = c('TSS','ROC'))
```
To get the cross-validation table and visualize it on your [`BIOMOD.formated.data`]( https://biomodhub.github.io/biomod2/reference/BIOMOD.formated.data.html) or [`BIOMOD.formated.data.PA`](https://biomodhub.github.io/biomod2/reference/BIOMOD.formated.data.PA.html) object.
```R
myCalibLines <- get_calib_lines(myBiomodModelOut)
plot(myBiomodData, calib.lines = myCalibLines)
```
To create a cross-validation table with [`bm_CrossValidation`](../reference/bm_CrossValidation.html).
```R
bm_CrossValidation(bm.format = myBiomodData,
strategy = "strat",
k = 2,
balance = "presences",
strat = "x")
```
Example of a table for the `user.defined` method : `myCVtable`.
| _PA1_RUN1 | _PA1_RUN2 | _PA2_RUN1 | _PA2_RUN2 |
| --------- | --------- | --------- | --------- |
| FALSE | FALSE | FALSE | TRUE |
| TRUE | TRUE | FALSE | FALSE |
| TRUE | TRUE | TRUE | TRUE |
| ... | ... | ... | ... |
```R
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'Example',
models = c('RF', 'GLM'),
CV.strategy = 'user.defined',
CV.user.table = myCVtable,
metric.eval = c('TSS','ROC'))
```
_You can find more examples in the [Secondary functions vignette](examples_2_secundaryFunctions.html)._
#### ☒ Evaluation
To add an independent dataset for evaluation, you will need to provide the correspondent environment variable (`myEvalExpl`) as a raster, a matrix or a data.frame.
_Case 1_ : If your evaluation response (`myEvalResp`) is a raster :
```R
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName,
eval.resp.var = myEvalResp,
eval.expl.var = myEvalExpl)
```
_Case 2_ : If your evaluation response (`myEvalResp`) is a vector :
- you also need the coordinates of your response points : `myEvalCoord`
- if `myEvalExpl` is a data.frame or a matrix, be sure the points are in the same order.
```R
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName,
eval.resp.var = myEvalResp,
eval.expl.var = myEvalExpl,
eval.resp.xy = myEvalCoord)
```
### ✖ References
- **Wenger, S.J.** and **Olden, J.D.** (**2012**), *Assessing transferability of ecological models: an underappreciated aspect of statistical validation.* Methods in Ecology and Evolution, 3: 260-267. https://doi.org/10.1111/j.2041-210X.2011.00170.x
- **Muscarella, R.**, Galante, P.J., Soley-Guardia, M., Boria, R.A., Kass, J.M., Uriarte, M. and Anderson, R.P. (**2014**), *ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models.* Methods in Ecology and Evolution, 5: 1198-1205. https://doi.org/10.1111/2041-210X.12261