--- title: "intSDM: R package to produce species distribution models in a reproducible framework" output: rmarkdown::html_vignette date: "`r Sys.Date()`" bibliography: '`r system.file("References.bib", package="intSDM")`' biblio-style: authoryear vignette: > %\VignetteIndexEntry{Example creating a reproducible workflow for red listed vascular plants} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, fig.width=8, fig.height=5, warning = FALSE, message = FALSE) ``` # Introduction In support of our manuscript, we developed an R [@R] package to help construct integrated species distribution models (ISDMs) from disparate datasets in a simple and reproducible framework. This *R Markdown* document presents an illustration of the package by creating an ISDM for red-listed plant species obtained via the *Vascular Plant Field Notes* survey program in Norway, as well as citizen-science data obtained from *Global Biodiversity Information Facility* (*GBIF*). The first step in exploring this document is to download the package using the following script: ```{r load packages} library(intSDM) library(INLA) ``` *intSDM* has two main functions: `startWorkflow()` and `sdmWorkflow()`. The first of which is designed to setup and specify all the individual components of the workflow using different slot functions. The functions related to this object include: | Function name | Function use | |-----------------------|------------------------------------------------------------------------------------------------------------| | `.$plot()` | Plot data and other objects required for the model. | | `.$addStructured()` | Add data not available on GBIF. | | `.$addMesh()` | Create an *inla.mesh* object. | | `.$addGBIF()` | Add data from GBIF. | | `.$addArea()` | Specify sampling domain. | | `.$addCovariates()` | Add spatial covariates. | | `.$crossValidation()` | Specify the cross-validation method. | | `.$modelOptions()` | Add *R-INLA [@martins2013bayesian], inlabru [@bachl2019inlabru]* and *PointedSDMs* [@PointedSDMs] options. | | `.$specifySpatial()` | Add penalizing complexity priors to the spatial effects. | | `.$biasFields()` | Specify an additional spatial effect for a dataset. | | `.$workflowOutput()` | Specify the output of the workflow. | | `.$obtainMeta()` | Obtain metadata for the occurrence records. | `sdmWorkflow()` implements the workflow based on the objects added in `startWorkflow()`. The output of this function is a list of objects specified in `.$workflowOutput()`. ## Workflow setup To start the workflow, we need to specify the coordinate reference system (CRS) considered for the analysis as well as the species used. The three species selected for this analysis from the *Vascular Plant Field Notes* (*arnica montana*, *fraxinus excelsior* and *ulmus glabra*) have records predominantly spread across the southern and eastern part of Norway. However the species *ulmus glabra* (which has the largest spread of the three species selected), has some of the records approaching the middle and middle-upper parts of Norway. The other arguments (*saveOptions* and *Save)* should be used if the user wants the objects to be saved in a folder created by the function. ```{r initialize workflow} workflow <- startWorkflow( Projection = '+proj=utm +zone=32 +ellps=WGS84 +datum=WGS84 +units=km +no_defs', Species = c("Fraxinus_excelsior", "Ulmus_glabra", "Arnica_montana"), saveOptions = list(projectName = 'Vascular'), Save = FALSE ) ``` ### Select area Next we specify the study domain for the study: in this case Norway. This can be achieved using either the *countryName* argument which will then access the object from the *giscoR* [@giscoR] R package, or by using *Object* and supplying our own polygon object. In this case we choose to add our own polygon object because the rigid coastline of a unedited Norway polygon will cause issues further down. We then use `.$plot()` to see what the boundary looks like. ```{r addArea} yes <- FALSE if (yes) { Norway <- giscoR::gisco_get_countries(country = 'Norway', resolution = 60) Norway <- st_cast(st_as_sf(Norway), 'POLYGON') Norway <- Norway[which.max(st_area(Norway)),] Norway <- rmapshaper::ms_simplify(Norway, keep = 0.8) workflow$addArea(Object = Norway) workflow$plot() } Norway <- readRDS('IntegratedLakefish/Norway.rds') workflow$addArea(Object = Norway) ``` ### Adding occurrence data Species' occurrence data is certainly the most important component of a SDM, and *intSDM* has two slot functions to help you add data into the workflow: `.$addGBIF()` and `.$addStructured()`. The former of which uses the *rgbif* package to download data directly from GBIF. For this function, we need to specify the name of the dataset (*datasetName*) and the type of the dataset (*datasetType*) -- which can be one of *PO, PA* or *Counts*. The *...* argument is used to specify any addiditional arguments for `rgbif::occ_data()` [@chamberlain2022rgbif] (in this case, *limit* and *datasetKey*). If *`datasetType = 'PA'`*, absences may be generated using *`generateAbsences = TRUE`*. This will treat the obtained data as a checklist survey data: combining all the sampling locations for the species in the dataset, and creating absences when a given species did not occur in a given region. For this example we consider three sources of data. The *Vascular Plant Field Notes* is a collection of observations provided by the *Norwegian University of Science and Technolog*y's (NTNU) [@NTNU] University Museum and the *University of Oslo* (UiO) [@UiO], containing records of standardized cross-lists of most vascular plants found in Norway. We treat these two datasets as detection/non-detection data, generating absences in sampling locations where the species does not occur. The other source of data considered comes from the Norwegian Species Observation service (published by *Artsdatabanken*) [@Ardatasbanken]. This data is a collection of citizen science records -- and as a result we treat it as presence-only data. ```{r addGBIF} workflow$addGBIF(datasetName = 'NTNU', datasetType = 'PA', limit = 10000, coordinateUncertaintyInMeters = '0,50', generateAbsences = TRUE, datasetKey = 'd29d79fd-2dc4-4ef5-89b8-cdf66994de0d') workflow$addGBIF(datasetName = 'UiO', datasetType = 'PA', limit = 10000, coordinateUncertaintyInMeters = '0,50', generateAbsences = TRUE, datasetKey = 'e45c7d91-81c6-4455-86e3-2965a5739b1f') workflow$addGBIF(datasetName = 'CZ', datasetType = 'PO', coordinateUncertaintyInMeters = '0,50', limit = 10000, datasetKey = 'b124e1e0-4755-430f-9eab-894f25a9b59c') workflow$plot(Species = TRUE) ``` ### Adding covariate data Covariate data may be added to the model using `.$addCovariates()`. Layers from *WorldClim [@fick2017worldclim]* may be accessed using the *worldClim argument*. This in turn uses the *geodata [@geodata]* R package to obtain *spatRaster* objects of the covariates cropped around the study domain. Other covariate layers may be added using the *Object* argument. ```{r addCovariates, eval = FALSE} workflow$addCovariates(worldClim = 'tavg', res = 5, Function = scale) workflow$plot(Covariates = TRUE) ``` ### Metadata We can then view the metadata for the obtained occurrence records using the `.$obtainMeta()` function, which will give us the citation for the datasets used in this workflow. ```{r metadata} workflow$obtainMeta() ``` ### Creating an *inla.mesh* object One of the objects required for our model is an *inla.mesh* object, which we will use in the approximation of our spatial random fields. The `.$addMesh()` function's argument *...* uses `INLA::inla.mesh.2d()` to create this object. ```{r INLA} workflow$addMesh(cutoff = 20 * 0.25, max.edge = c(60, 80)*0.5, #0.25 offset= c(30, 40)) workflow$plot(Mesh = TRUE) ``` ### Specify priors Furthermore we also used *penalizing complexity* (PC) priors in our model, which are designed to control the spatial range and standard deviation in the GRF's Matérn covariance function in order to reduce over-fitting in the model [@simpson2017penalising]. ```{r Priors} workflow$specifySpatial(prior.range = c(100, 0.1), prior.sigma = c(1, 0.1), constr = FALSE) ``` We also specify priors for the intercept terms and the fixed effects of the models. In this vignette we choose tight priors (mean = 0; precision = 1). ```{r Fixed priors} workflow$specifyPriors(effectNames = 'Intercept', Mean = 0, Precision = 1) workflow$specifyPriors('tavg', Mean = 0, Precision = 1) ``` ### Bias field We specify an additional spatial effect for the citizen science data using `.$biasFields()` to account biases in the collection process [@simmonds2020more]. ```{r Bias} workflow$biasFields('CZ', prior.range = c(100, 0.1), #1 #0.1 prior.sigma = c(1, 0.1), #1 #0.1 constr = FALSE) ``` ### Model options We specify the model output using the function `.$workflowOutput`. In this workflow, we want to return the *R-INLA* model objects and the maps of the predicted intensity. ```{r options} #workflow$crossValidation(Method = 'Loo') workflow$workflowOutput(c('Maps', 'Model', 'Bias')) ``` ## Running workflow The workflow is then implemented using the `sdmWorkflow()` function. We also specify some *R-INLA* options to speed and stabilize the estimates of the model using `inlaOptions`. Due to the lengthy time it requires to produce this map, inference is not made in this vignette. However the script is available below for the user to run the model themselves. ```{r Maps} Maps <- sdmWorkflow(workflow,inlaOptions = list(control.inla=list(int.strategy = 'ccd', strategy = 'gaussian', cmin = 0, diagonal = 0.1, control.vb=list(enable = FALSE)), safe = TRUE, verbose = TRUE, inla.mode = 'experimental'), predictionDim = c(400, 400)) ``` ## Maps Maps of the predicted intensity are given as follows. ```{r MapsOut} Maps$Fraxinus_excelsior$Maps Maps$Ulmus_glabra$Maps Maps$Arnica_montana$Maps saveRDS(Maps, 'Maps.rds') ```