--- title: "Species Distribution Response Curves" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Species Distribution Response Curves} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 10.5, fig.height = 6.5, out.width = "100%", fig.align = "center" ) species_example <- requireNamespace("disdat", quietly = TRUE) && requireNamespace("randomForest", quietly = TRUE) ensemble_example <- species_example && requireNamespace("mgcv", quietly = TRUE) ``` This vignette shows a compact species distribution workflow using [`disdat`](https://cran.r-project.org/package=disdat) data, a down-sampled random forest classifier, and model-agnostic response plots from `curves`. The example uses one species from New South Wales (`"nsw14"`), combines its presence records with 10,000 random background points, and then down-samples the background class during model fitting so each tree sees a balanced sample. The code below is shown in full. It is evaluated when the optional packages `disdat` and `randomForest` are installed; the short ensemble example also requires `mgcv`. ## Prepare the training data ```{r, eval = species_example, message = FALSE, warning = FALSE} spID <- "nsw14" pr <- disdat::disPo("NSW") bg <- disdat::disBg("NSW") pr <- pr[pr$spid == spID, ] training <- rbind(pr, bg) covars <- c( "disturb", "mi", "rainann", "raindq", "rugged", "soildepth", "tempann", "topo", "vegsys" ) training <- training[, c("occ", covars)] training$vegsys <- as.factor(training$vegsys) numeric_covars <- setdiff(covars, "vegsys") training[numeric_covars] <- lapply( training[numeric_covars], function(x) as.numeric(scale(x)) ) head(training[, 1:5]) ``` The numeric covariates are centred and scaled to keep the example plots on comparable axes. This is only for readability; `curves` does not require standardised predictors. ## Fit a down-sampled random forest `randomForest::randomForest()` can balance each bootstrap sample with `sampsize`. Here the number of background samples drawn for each tree is set to the number of presences, which is a common way to reduce class imbalance. The number of trees is deliberately modest so the vignette remains quick to build. ```{r, eval = species_example, message = FALSE, warning = FALSE} # get the number of presences (before converting to factors); pr_num <- sum(training$occ) smpsize <- c("0" = pr_num, "1" = pr_num) training$occ <- as.factor(training$occ) set.seed(20260501) rf_model <- randomForest::randomForest( occ ~ ., data = training, ntree = 500, sampsize = smpsize, replace = TRUE ) ``` ## Plot univariate and bivariate response curves For binary classification models, `predict(..., type = "prob")` returns one column per class. In this example the presence class is `"1"`, so the response plots use `response = "1"` to make that choice explicit. Because the model is fit with presences and background samples rather than confirmed absences, the plotted values are interpreted as relative likelihoods, not occurrence probabilities. ```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE} predictors <- training[, covars] curves::univariate( rf_model, predictors, method = "profile", type = "prob", response = "1", ylab = "Relative likelihood" ) ``` The same interface can be used for partial dependence, ICE, and ALE plots. `method = "pdp"` averages predictions over sampled rows from `dat`, whereas `method = "ice+pdp"` overlays that average on top of the sampled ICE curves. Here `n = 100` controls the grid resolution for numeric predictors, while `background_n` controls how many rows are sampled for PDP/ICE. For `method = "pdp"`, `interval = "quantile"` can be used to show a central pointwise band around the averaged curve. ```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE} set.seed(20260501) curves::univariate( rf_model, predictors, method = "pdp", n = 100, background_n = 200, interval = "quantile", interval_level = 0.8, type = "prob", response = "1", ylab = "Relative likelihood" ) ``` ```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE} set.seed(20260502) curves::univariate( rf_model, predictors, method = "ice+pdp", n = 100, background_n = 200, type = "prob", response = "1", ylab = "Relative likelihood" ) ``` ALE curves accumulate local finite differences within the observed predictor distribution, which often makes them less sensitive than PDPs to strongly correlated predictors. In `curves`, univariate ALE supports both numeric and factor predictors; bivariate ALE surfaces currently use numeric predictor pairs. ```{r, eval = species_example, fig.height = 8, message = FALSE, warning = FALSE} curves::univariate( rf_model, predictors, method = "ale", n = 40, type = "prob", response = "1", ylab = "Relative likelihood" ) ``` The bivariate examples use `mi` and `rainann`, two numeric covariates in the training data. ```{r, eval = species_example, fig.width = 5.6, fig.height = 4, message = FALSE, warning = FALSE} set.seed(20260503) curves::bivariate( rf_model, predictors, pairs = c(2, 3), method = "pdp", background_n = 200, rug = TRUE, plot_type = "heatmap", type = "prob", response = "1", zlab = "Relative likelihood" ) ``` For correlated covariates, a centred bivariate ALE surface is often more stable than a PDP surface because it accumulates local differences only where the data are observed. ```{r, eval = species_example, fig.width = 5.6, fig.height = 4, message = FALSE, warning = FALSE} curves::bivariate( rf_model, predictors, pairs = c(2, 3), method = "ale", n = 40, plot_type = "heatmap", type = "prob", response = "1", zlab = "ALE" ) ``` ## Ensemble response curves Species distribution modelling often uses ensembles of models fit to the same response and predictors. `multimodel()` computes the chosen response-curve method for each ensemble member, then combines those curves pointwise. This short example fits a random forest and a GAM to the same two predictors and puts both models on the predicted relative likelihood scale before averaging. ```{r, eval = ensemble_example, fig.width = 7, fig.height = 4.5, message = FALSE, warning = FALSE} ensemble_covars <- c("mi", "rainann") ensemble_training <- training[, c("occ", ensemble_covars)] set.seed(20260504) rf_member <- randomForest::randomForest( occ ~ ., data = ensemble_training, ntree = 300, sampsize = smpsize, replace = TRUE ) gam_training <- transform( ensemble_training, occ_num = as.integer(as.character(occ)) ) gam_member <- mgcv::gam( occ_num ~ s(mi, k = 6) + s(rainann, k = 6), data = gam_training, family = binomial(), method = "REML" ) rf_likelihood <- function(model, newdata) { predict(model, newdata, type = "prob")[, "1"] } gam_likelihood <- function(model, newdata) { as.numeric(stats::predict(model, newdata, type = "response")) } set.seed(20260505) curves::multimodel( list(rf_member, gam_member), ensemble_training[, ensemble_covars], fun = list(rf_likelihood, gam_likelihood), method = "pdp", n = 60, background_n = 200, show_models = TRUE, ylab = "Relative likelihood" ) ``` For a 3D surface, a small wrapper around `predict()` is still convenient when you want direct control over the returned prediction vector. ```{r, eval = species_example && requireNamespace("plotly", quietly = TRUE), message = FALSE, warning = FALSE} rf_likelihood <- function(model, newdata) { predict(model, newdata, type = "prob")[, "1"] } curves::bivariate( rf_model, predictors, pairs = c(2, 3), plot_type = "surface", fun = rf_likelihood, zlab = "Relative likelihood" ) ```