---
title: "2. Block cross-validation for species distribution modelling"
author: "Roozbeh Valavi, Jane Elith, José Lahoz-Monfort and Gurutzeta Guillera-Arroita"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{2. Block cross-validation for species distribution modelling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

Species distribution models (SDMs) are widely used tools for predicting the potential distribution of a species based on environmental variables. However, it is crucial to evaluate the performance of these models to ensure their accuracy and reliability. One commonly used method for evaluating the performance of SDMs is **block cross-validation** (read more in *Valavi et al. 2019* and the Tutorial 1). This approach allows for a more robust evaluation of the model as it accounts for spatial autocorrelation and other spatial dependencies (Roberts et al. 2017). This document illustrates how to utilize the `blockCV` package to evaluate the performance of SDMs using block cross-validation.

Two examples are provided: modelling using the `randomForest`, and `biomod2` packages.

Check new updates of `blockCV` in the tutorial 1- blockCV introduction: how to create block cross-validation folds.

Please cite `blockCV` by: *Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232.* [doi: 10.1111/2041-210X.13107](https://doi.org/10.1111/2041-210X.13107)

## Reading and plotting data
The `blockCV` package contains the raw format of the following data:

- Raster covariates of Australia (`.tif`)
- Simulated species data (`.csv`)

These data are used to illustrate how the package is used. The raster data include several bioclimatic variables for Australia. The species data include presence-absence records (binary) of a simulated species.

```{r echo=FALSE}
options(scipen = 10)
```


To load the package raster data use:

```{r, fig.height=5, fig.width=7.2, warning=FALSE, message=FALSE}
library(sf) # working with spatial vector data
library(terra) # working with spatial raster data
library(tmap) # plotting maps

# load raster data
# the pipe operator |> is available for R version 4.1 or higher
rasters <- system.file("extdata/au/", package = "blockCV") |>
  list.files(full.names = TRUE) |>
  terra::rast()

```

The presence-absence species data include `243` presence points and `257` absence points.

```{r, fig.height=4.5, fig.width=7.1}
# load species presence-asence data and convert to sf
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
head(points)

```

The appropriate format of species data for the `blockCV` package is simple features (from the `sf` package). The data is provide in [GDA2020 / GA LCC](https://epsg.io/7845) coordinate reference system with `"EPSG:7845"` as defined by `crs = 7845`. We convert the `data.frame` to `sf` as follows:

```{r, fig.height=4.5, fig.width=7.1}
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)
```

Let's plot the species data using [`tmap`](https://cran.r-project.org/package=tmap)  package:

```{r, fig.height=4.5, fig.width=7.1}
tm_shape(rasters[[1]]) +
  tm_raster(legend.show = FALSE, n = 30, palette = gray.colors(10)) +
  tm_shape(pa_data) +
  tm_dots(col = "occ", style = "cat", size = 0.1)

```

# Generating block CV folds

Here, we generate two CV strategies, one k-fold CV using `cv_spatial` and one LOO CV using `cv_nndm`. See more options and configurations in the *Tutorial 1 - introduction to `blockCV`*.

```{r message=TRUE, warning=TRUE}
library(blockCV)

```


Creating spatial blocks: 

```{r, fig.keep='all', warning=FALSE, message=FALSE, fig.height=5, fig.width=7}
scv1 <- cv_spatial(
  x = pa_data,
  column = "occ", # the response column (binary or multi-class)
  r = rasters,
  k = 5, # number of folds
  size = 360000, # size of the blocks in metres
  selection = "random", # random blocks-to-fold
  iteration = 50, # find evenly dispersed folds
  progress = FALSE, # turn off progress bar
  biomod2 = TRUE, # also create folds for biomod2
  raster_colors = terrain.colors(10, rev = TRUE) # options from cv_plot for a better colour contrast
) 

```


Now, let's create LOO CV folds with nearest neighbour distance matching (NNDM; Milà et al. 2022) algorithm. To run `cv_nndm`, you need a measure of spatial autocorrelation present in your data. This can be done wither by 1) fitting a model and use it's residual to calculate spatial autocorrelation, or 2) use the autocorrelation of response variable for it (Milà et al, 2022; Roberts et al. 2017). Here, we calculate spatial autocorrelation range in the response using the `cv_spatial_autocor` function.

```{r}
range <- cv_spatial_autocor(
  x = pa_data, # species data
  column = "occ", # column storing presence-absence records (0s and 1s)
  plot = FALSE
)

range$range
```

So the range of spatial autocorrelation is roughly `360` kilometres.

```{r fig.height=5, fig.width=7}
scv2 <- cv_nndm(
  x = pa_data,
  column = "occ",
  r = rasters,
  size = 360000, # range of spatial autocorrelation
  num_sample = 10000, # number of samples of prediction points
  sampling = "regular", # sampling methods; it can be random as well
  min_train = 0.1, # minimum portion to keep in each train fold
  plot = TRUE
)
```

You can visualise the generated folds of both methods using `cv_plot` function. Here is three folds from the `cv_nndm` object:

```{r}
# see the number of folds in scv2 object
scv2$k

```


```{r fig.height=5, fig.width=8}
cv_plot(
  cv = scv2, # cv object
  x = pa_data, # species spatial data
  num_plots = c(1, 10, 100) # three of folds to plot
)
```


## Evaluating SDMs with block cross-validation: examples

In this section, we show how to use the folds generated by `blockCV` in the previous sections for the evaluation of SDMs constructed on the species data available in the package. The `blockCV` stores training and testing folds in three different formats. The common format for all three blocking strategies is a list of the indices of observations in each fold. For `cv_spatial` and `cv_cluster` (but not `cv_buffer` and `cv_nndm`), the folds are also stored in a matrix format suitable for the `biomod2` package and a vector of fold's number for each observation. This is equal to the number of observation in species spatial data. These three formats are stored in the cv objects as `folds_list`, `biomod_table` and `folds_ids` respectively.


### Using `blockCV` with Random Forest model

Folds generated by `cv_nndm` function are used here (a training and testing fold for each record) to show how to use folds from this function (the `cv_buffer` is also similar to this approach) for evaluation species distribution models.   

Note that with `cv_nndm` using presence-absence data (and any other type of data except for presence-background data when `presence_bg = TRUE` is used), there is only one point in each testing fold, and therefore AUC cannot be calculated for each fold separately. Instead, the value of each point is first predicted to the testing point (of each fold), and then a unique AUC is calculated for the full set of predictions.

```{r, eval=FALSE}
# loading the libraries
library(randomForest)
library(precrec)

# extract the raster values for the species points as a dataframe
model_data <- terra::extract(rasters, pa_data, df = TRUE, ID = FALSE)
# adding species column to the dataframe
model_data$occ <- as.factor(pa_data$occ)
head(model_data)

# extract the fold indices from buffering object 
# created in the previous section
# the folds_list works for all four blocking strategies
folds <- scv2$folds_list

# create a data.frame to store the prediction of each fold (record)
test_table <- pa_data
test_table$preds <- NA

for(k in seq_len(length(folds))){
  # extracting the training and testing indices
  # this way works with folds_list list (but not folds_ids)
  trainSet <- unlist(folds[[k]][1]) # training set indices; first element
  testSet <- unlist(folds[[k]][2]) # testing set indices; second element
  rf <- randomForest(occ ~ ., model_data[trainSet, ], ntree = 500) # model fitting on training set
  test_table$preds[testSet] <- predict(rf, model_data[testSet, ], type = "prob")[,2] # predict the test set
}

# calculate Area Under the ROC and PR curves and plot the result
precrec_obj <- evalmod(scores = test_table$preds, labels = test_table$occ)
auc(precrec_obj)

```


```{r echo=FALSE}
# to not run the model and reduce run time; result are calculated and loaded
read.csv("../man/figures/roc_rf.csv") 

```

Plot the curves:

```{r, eval=FALSE, fig.height=3.7, fig.width=7}
library(ggplot2)

autoplot(precrec_obj)

```

![](../man/figures/rocpr.jpeg)


### Using `blockCV` in `biomod2` package

Package [`biomod2`](https://CRAN.R-project.org/package=biomod2) (Thuiller et al., 2017) is a commonly used platform for modelling species distributions in an ensemble framework. In this example, we show how to use `blockCV` folds in `biomod2`. In this example, the folds generated by `cv_spatial` is used to evaluate three modelling methods implemented in `biomod2`. The `CV.user.table` can be generated by both `cv_spatial` and `cv_cluster` functions and it is stored as `biomod_table` in their output objects (note: in the older versions of the `biomod2` package, the argument `data.split.table` was used for external CV folds. This has now changed to `CV.user.table` that also requires `CV.strategy = "user.defined"` and new column names. See the example below).

```{r eval=FALSE}
# loading the library
library(biomod2)

# extract the raster values for the species points as a dataframe
raster_values <- terra::extract(rasters, pa_data, df = TRUE, ID = FALSE)

# 1. Formatting Data
biomod_data <- BIOMOD_FormatingData(resp.var = pa_data$occ,
                                    expl.var = raster_values,
                                    resp.xy = sf::st_coordinates(pa_data),
                                    resp.name = "occ",
                                    na.rm = TRUE)

# 2. Defining the folds for CV.user.table
# note that biomod_table should be used here not folds
# use generated folds from cv_spatial in previous section
spatial_cv_folds <- scv1$biomod_table
# the new update of the package biomod2 (v4.2-3 <) requires the names to be as below
colnames(spatial_cv_folds) <- paste0("_allData_RUN", 1:ncol(spatial_cv_folds))

# 3. Defining Models Options; using default options here. You can use your own settting here.
biomod_options <- bm_ModelingOptions(data.type = "binary", strategy = "bigboss")

# 4. Model fitting
biomod_model_out <- BIOMOD_Modeling(biomod_data,
                                    models = c('GLM','MARS','GBM'),
                                    CV.strategy = "user.defined",
                                    CV.user.table	= spatial_cv_folds,
                                    OPT.user = biomod_options,
                                    var.import = 0,
                                    metric.eval = c('ROC'),
                                    do.full.models = TRUE)

```

```{r, eval=FALSE}
# 5. Model evaluation
biomod_model_eval <- get_evaluations(biomod_model_out)
biomod_model_eval[c("run", "algo", "metric.eval", "calibration", "validation")]

```

```{r echo=FALSE}
# to not run the model and reduce run time; result are calculated and loaded
read.csv("../man/figures/evl_biomod.csv") 

```

The `validation` column shows the result of spatial cross-validation and each RUN is a CV fold.


## References:
- C. Milà, J. Mateu, E. Pebesma, and H. Meyer, Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation, Methods in Ecology and Evolution (2022).

- Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., others, 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.

- Thuiller W, Georges D, Guéguen M, Engler R, Breiner F, Lafourcade B, Patin R, Blancheteau H (2024). biomod2: Ensemble Platform for Species Distribution Modeling. R package version 4.2-5. https://CRAN.R-project.org/package=biomod2.

- Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. **blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models**. *Methods Ecol Evol.* 2019; 10:225–232. [doi: 10.1111/2041-210X.13107](https://doi.org/10.1111/2041-210X.13107)