Species Distribution Response Curves

This vignette shows a compact species distribution workflow using disdat data, a down-sampled random forest classifier, and model-agnostic response plots from curves.

The example uses one species from New South Wales ("nsw14"), combines its presence records with 10,000 random background points, and then down-samples the background class during model fitting so each tree sees a balanced sample.

The code below is shown in full. It is evaluated when the optional packages disdat and randomForest are installed; the short ensemble example also requires mgcv.

Prepare the training data

spID <- "nsw14"
pr <- disdat::disPo("NSW")
bg <- disdat::disBg("NSW")

pr <- pr[pr$spid == spID, ]
training <- rbind(pr, bg)

covars <- c(
  "disturb", "mi", "rainann", "raindq",
  "rugged", "soildepth",
  "tempann", "topo", "vegsys"
)

training <- training[, c("occ", covars)]
training$vegsys <- as.factor(training$vegsys)

numeric_covars <- setdiff(covars, "vegsys")
training[numeric_covars] <- lapply(
  training[numeric_covars],
  function(x) as.numeric(scale(x))
)

head(training[, 1:5])
#>      occ    disturb        mi  rainann    raindq
#> 1150   1 -0.1104779 0.9155303 1.268810 0.7474699
#> 1151   1 -0.1104779 0.9155303 1.274938 0.7474699
#> 1152   1 -0.1104779 1.0000868 1.281066 0.8759767
#> 1153   1 -1.1455181 1.2537564 1.238172 0.8759767
#> 1154   1 -1.1455181 1.1691998 1.201405 0.8759767
#> 1155   1 -0.1104779 1.0846433 1.048212 0.7474699

The numeric covariates are centred and scaled to keep the example plots on comparable axes. This is only for readability; curves does not require standardised predictors.

Fit a down-sampled random forest

randomForest::randomForest() can balance each bootstrap sample with sampsize. Here the number of background samples drawn for each tree is set to the number of presences, which is a common way to reduce class imbalance. The number of trees is deliberately modest so the vignette remains quick to build.

# get the number of presences (before converting to factors);
pr_num <- sum(training$occ)
smpsize <- c("0" = pr_num, "1" = pr_num)

training$occ <- as.factor(training$occ)

set.seed(20260501)
rf_model <- randomForest::randomForest(
  occ ~ .,
  data = training,
  ntree = 500,
  sampsize = smpsize,
  replace = TRUE
)

Plot univariate and bivariate response curves

For binary classification models, predict(..., type = "prob") returns one column per class. In this example the presence class is "1", so the response plots use response = "1" to make that choice explicit. Because the model is fit with presences and background samples rather than confirmed absences, the plotted values are interpreted as relative likelihoods, not occurrence probabilities.

predictors <- training[, covars]

curves::univariate(
  rf_model,
  predictors,
  method = "profile",
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)

The same interface can be used for partial dependence, ICE, and ALE plots. method = "pdp" averages predictions over sampled rows from dat, whereas method = "ice+pdp" overlays that average on top of the sampled ICE curves. Here n = 100 controls the grid resolution for numeric predictors, while background_n controls how many rows are sampled for PDP/ICE. For method = "pdp", interval = "quantile" can be used to show a central pointwise band around the averaged curve.

set.seed(20260501)
curves::univariate(
  rf_model,
  predictors,
  method = "pdp",
  n = 100,
  background_n = 200,
  interval = "quantile",
  interval_level = 0.8,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)

set.seed(20260502)
curves::univariate(
  rf_model,
  predictors,
  method = "ice+pdp",
  n = 100,
  background_n = 200,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)

ALE curves accumulate local finite differences within the observed predictor distribution, which often makes them less sensitive than PDPs to strongly correlated predictors. In curves, univariate ALE supports both numeric and factor predictors; bivariate ALE surfaces currently use numeric predictor pairs.

curves::univariate(
  rf_model,
  predictors,
  method = "ale",
  n = 40,
  type = "prob",
  response = "1",
  ylab = "Relative likelihood"
)

The bivariate examples use mi and rainann, two numeric covariates in the training data.

set.seed(20260503)
curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  method = "pdp",
  background_n = 200,
  rug = TRUE,
  plot_type = "heatmap",
  type = "prob",
  response = "1",
  zlab = "Relative likelihood"
)

For correlated covariates, a centred bivariate ALE surface is often more stable than a PDP surface because it accumulates local differences only where the data are observed.

curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  method = "ale",
  n = 40,
  plot_type = "heatmap",
  type = "prob",
  response = "1",
  zlab = "ALE"
)

Ensemble response curves

Species distribution modelling often uses ensembles of models fit to the same response and predictors. multimodel() computes the chosen response-curve method for each ensemble member, then combines those curves pointwise. This short example fits a random forest and a GAM to the same two predictors and puts both models on the predicted relative likelihood scale before averaging.

ensemble_covars <- c("mi", "rainann")
ensemble_training <- training[, c("occ", ensemble_covars)]

set.seed(20260504)
rf_member <- randomForest::randomForest(
  occ ~ .,
  data = ensemble_training,
  ntree = 300,
  sampsize = smpsize,
  replace = TRUE
)

gam_training <- transform(
  ensemble_training,
  occ_num = as.integer(as.character(occ))
)

gam_member <- mgcv::gam(
  occ_num ~ s(mi, k = 6) + s(rainann, k = 6),
  data = gam_training,
  family = binomial(),
  method = "REML"
)


rf_likelihood <- function(model, newdata) {
  predict(model, newdata, type = "prob")[, "1"]
}

gam_likelihood <- function(model, newdata) {
  as.numeric(stats::predict(model, newdata, type = "response"))
}

set.seed(20260505)
curves::multimodel(
  list(rf_member, gam_member),
  ensemble_training[, ensemble_covars],
  fun = list(rf_likelihood, gam_likelihood),
  method = "pdp",
  n = 60,
  background_n = 200,
  show_models = TRUE,
  ylab = "Relative likelihood"
)

For a 3D surface, a small wrapper around predict() is still convenient when you want direct control over the returned prediction vector.

rf_likelihood <- function(model, newdata) {
  predict(model, newdata, type = "prob")[, "1"]
}

curves::bivariate(
  rf_model,
  predictors,
  pairs = c(2, 3),
  plot_type = "surface",
  fun = rf_likelihood,
  zlab = "Relative likelihood"
)