This vignette shows a compact species distribution workflow using disdat
data, a down-sampled random forest classifier, and model-agnostic
response plots from curves.
The example uses one species from New South Wales
("nsw14"), combines its presence records with 10,000 random
background points, and then down-samples the background class during
model fitting so each tree sees a balanced sample.
The code below is shown in full. It is evaluated when the optional
packages disdat and randomForest are
installed; the short ensemble example also requires
mgcv.
spID <- "nsw14"
pr <- disdat::disPo("NSW")
bg <- disdat::disBg("NSW")
pr <- pr[pr$spid == spID, ]
training <- rbind(pr, bg)
covars <- c(
"disturb", "mi", "rainann", "raindq",
"rugged", "soildepth",
"tempann", "topo", "vegsys"
)
training <- training[, c("occ", covars)]
training$vegsys <- as.factor(training$vegsys)
numeric_covars <- setdiff(covars, "vegsys")
training[numeric_covars] <- lapply(
training[numeric_covars],
function(x) as.numeric(scale(x))
)
head(training[, 1:5])
#> occ disturb mi rainann raindq
#> 1150 1 -0.1104779 0.9155303 1.268810 0.7474699
#> 1151 1 -0.1104779 0.9155303 1.274938 0.7474699
#> 1152 1 -0.1104779 1.0000868 1.281066 0.8759767
#> 1153 1 -1.1455181 1.2537564 1.238172 0.8759767
#> 1154 1 -1.1455181 1.1691998 1.201405 0.8759767
#> 1155 1 -0.1104779 1.0846433 1.048212 0.7474699The numeric covariates are centred and scaled to keep the example
plots on comparable axes. This is only for readability;
curves does not require standardised predictors.
randomForest::randomForest() can balance each bootstrap
sample with sampsize. Here the number of background samples
drawn for each tree is set to the number of presences, which is a common
way to reduce class imbalance. The number of trees is deliberately
modest so the vignette remains quick to build.
# get the number of presences (before converting to factors);
pr_num <- sum(training$occ)
smpsize <- c("0" = pr_num, "1" = pr_num)
training$occ <- as.factor(training$occ)
set.seed(20260501)
rf_model <- randomForest::randomForest(
occ ~ .,
data = training,
ntree = 500,
sampsize = smpsize,
replace = TRUE
)For binary classification models,
predict(..., type = "prob") returns one column per class.
In this example the presence class is "1", so the response
plots use response = "1" to make that choice explicit.
Because the model is fit with presences and background samples rather
than confirmed absences, the plotted values are interpreted as relative
likelihoods, not occurrence probabilities.
predictors <- training[, covars]
curves::univariate(
rf_model,
predictors,
method = "profile",
type = "prob",
response = "1",
ylab = "Relative likelihood"
)The same interface can be used for partial dependence, ICE, and ALE
plots. method = "pdp" averages predictions over sampled
rows from dat, whereas method = "ice+pdp"
overlays that average on top of the sampled ICE curves. Here
n = 100 controls the grid resolution for numeric
predictors, while background_n controls how many rows are
sampled for PDP/ICE. For method = "pdp",
interval = "quantile" can be used to show a central
pointwise band around the averaged curve.
set.seed(20260501)
curves::univariate(
rf_model,
predictors,
method = "pdp",
n = 100,
background_n = 200,
interval = "quantile",
interval_level = 0.8,
type = "prob",
response = "1",
ylab = "Relative likelihood"
)set.seed(20260502)
curves::univariate(
rf_model,
predictors,
method = "ice+pdp",
n = 100,
background_n = 200,
type = "prob",
response = "1",
ylab = "Relative likelihood"
)ALE curves accumulate local finite differences within the observed
predictor distribution, which often makes them less sensitive than PDPs
to strongly correlated predictors. In curves, univariate
ALE supports both numeric and factor predictors; bivariate ALE surfaces
currently use numeric predictor pairs.
curves::univariate(
rf_model,
predictors,
method = "ale",
n = 40,
type = "prob",
response = "1",
ylab = "Relative likelihood"
)The bivariate examples use mi and rainann,
two numeric covariates in the training data.
set.seed(20260503)
curves::bivariate(
rf_model,
predictors,
pairs = c(2, 3),
method = "pdp",
background_n = 200,
rug = TRUE,
plot_type = "heatmap",
type = "prob",
response = "1",
zlab = "Relative likelihood"
)For correlated covariates, a centred bivariate ALE surface is often more stable than a PDP surface because it accumulates local differences only where the data are observed.
curves::bivariate(
rf_model,
predictors,
pairs = c(2, 3),
method = "ale",
n = 40,
plot_type = "heatmap",
type = "prob",
response = "1",
zlab = "ALE"
)Species distribution modelling often uses ensembles of models fit to
the same response and predictors. multimodel() computes the
chosen response-curve method for each ensemble member, then combines
those curves pointwise. This short example fits a random forest and a
GAM to the same two predictors and puts both models on the predicted
relative likelihood scale before averaging.
ensemble_covars <- c("mi", "rainann")
ensemble_training <- training[, c("occ", ensemble_covars)]
set.seed(20260504)
rf_member <- randomForest::randomForest(
occ ~ .,
data = ensemble_training,
ntree = 300,
sampsize = smpsize,
replace = TRUE
)
gam_training <- transform(
ensemble_training,
occ_num = as.integer(as.character(occ))
)
gam_member <- mgcv::gam(
occ_num ~ s(mi, k = 6) + s(rainann, k = 6),
data = gam_training,
family = binomial(),
method = "REML"
)
rf_likelihood <- function(model, newdata) {
predict(model, newdata, type = "prob")[, "1"]
}
gam_likelihood <- function(model, newdata) {
as.numeric(stats::predict(model, newdata, type = "response"))
}
set.seed(20260505)
curves::multimodel(
list(rf_member, gam_member),
ensemble_training[, ensemble_covars],
fun = list(rf_likelihood, gam_likelihood),
method = "pdp",
n = 60,
background_n = 200,
show_models = TRUE,
ylab = "Relative likelihood"
)For a 3D surface, a small wrapper around predict() is
still convenient when you want direct control over the returned
prediction vector.