ldmppr
Workflow on
Simulated DataTo illustrate the use of the ldmppr
package, we will be
using the included small_example_data
dataset. This dataset
consists of locations and sizes for 121 points in a (25m x 25m) square
domain.
# Load the data
data("small_example_data")
# Display the first few rows of the dataset
nrow(small_example_data)
#> [1] 121
head(small_example_data)
#> x y size
#> 1 10.000000 14.000000 897.4857
#> 2 9.437449 10.841183 623.7816
#> 3 16.434584 10.734105 583.2180
#> 4 11.155480 8.432171 508.4971
#> 5 13.293043 22.866626 472.0684
#> 6 4.001424 13.901061 434.1340
We also include a collection of example raster files obtained from the following ESS-DIVE repository: https://data.ess-dive.lbl.gov/view/doi:10.15485/1652536.
Details related to generating and obtaining these datasets can be
found in the help file for the small_example_data
dataset.
The ldmppr
package provides tools for estimating marked
point processes with regularity and dependence between the marks and the
locations using self-correcting point processes, training mark models,
assessing model fit, and simulating spatio-temporal datasets. This
vignette demonstrates the core functionality of the package through an
example workflow. The approach we take in fitting a location dependent
marked point process is to equate it to a spatio-temporal process where
the marks (i.e. sizes) can be mapped to arrival times of events. This
allows us to use the machinery of spatio-temporal point processes to
estimate the parameters of the marked point process using a likelihood
based approach that is computationally feasible.
We start by estimating the parameters of a self-correcting
spatio-temporal point process using the
estimate_parameters_sc
function. This function makes use of
a user specified optimization algorithm in the nloptr
function to perform estimation through log-likelihood optimization. The
function requires the observed locations and observed arrival times,
which are generated from the observed sizes using a power law mapping
function. The power law mapping function depends on the hyperparameter
delta
, which controls the shape of the mapping relationship
between sizes and arrival times. The function also requires the grid
values for the optimization process, the initial parameter values, the
upper bounds for the parameters, and the optimization algorithm to use.
The function returns the optimal parameter estimates for the
self-correcting point process.
# Map the observed sizes to arrival times using the power_law_mapping function
parameter_estimation_data <- small_example_data %>%
dplyr::mutate(time = power_law_mapping(size = size, delta = .5)) %>%
dplyr::select(time, x, y)
# Define the grid values
x_grid <- seq(0, 25, length.out = 5)
y_grid <- seq(0, 25, length.out = 5)
t_grid <- seq(0, 1, length.out = 5)
# Define the parameter initialization values
parameter_inits <- c(1.5, 8.5, .015, 1.5, 3.2, .75, 3, .08)
# Define the upper bounds
upper_bounds <- c(1, 25, 25)
# Estimate the parameters
estimated_sc <- estimate_parameters_sc(
data = parameter_estimation_data,
x_grid = x_grid,
y_grid = y_grid,
t_grid = t_grid,
parameter_inits = parameter_inits,
upper_bounds = upper_bounds,
opt_algorithm = "NLOPT_LN_SBPLX",
nloptr_options = list(
maxeval = 300,
xtol_rel = 1e-3
),
verbose = FALSE
)
# Obtain the optimal parameter estimates
optimal_parameters <- estimated_sc$solution
print(optimal_parameters)
#> [1] 1.73298314 8.76107237 0.02549216 1.51659912 4.57635530 1.18075615 2.77500000
#> [8] 0.16552518
In practice, we recommend using a denser grid for the optimization
process to increase the likelihood that the global optimum is found. We
also note that the function
estimate_parameters_sc_parallel()
can be used to fit the
model for varying values of delta
in parallel to identify
the optimal mapping between sizes and arrival times.
Next, we use the train_mark_model
function to train a
suitably flexible mark model using location specific covariates and the
generated arrival times to predict sizes. The function uses the
xgboost
or ranger
engines to train the model
and may be run in parallel if desired. The user has control over the
choice of a Gradient Boosting Machine (GBM) or Random Forest (RF) model,
the bounds of the spatial domain, the inclusion of competition indices,
the competition radius, the correction method, the final model selection
metric, the number of cross validation folds, and the size of the
parameter tuning grid for the model.
# Load the example raster files
raster_paths <- list.files(system.file("extdata", package = "ldmppr"), pattern = "\\.tif$", full.names = TRUE)
rasters <- lapply(raster_paths, terra::rast)
scaled_rasters <- scale_rasters(rasters)
# Generate the time values using the power law mapping with optimal delta
model_training_data <- small_example_data %>%
dplyr::mutate(time = power_law_mapping(size, .5))
# Train the model
example_mark_model <- train_mark_model(
data = model_training_data,
raster_list = scaled_rasters,
model_type = "xgboost",
xy_bounds = c(0, 25, 0, 25),
parallel = FALSE,
include_comp_inds = TRUE,
competition_radius = 10,
correction = "none",
selection_metric = "rmse",
cv_folds = 5,
tuning_grid_size = 20,
verbose = TRUE
)
#> Processing data...
#> Training XGBoost model...
#> Training complete!
# Unbundle the model
example_mark_model <- bundle::unbundle(example_mark_model$bundled_model)
The train_mark_model
function returns a trained mark
model object that can be used to predict sizes for new locations. The
model object contains the trained model, the optimal hyperparameters,
the cross-validated performance metrics, and the feature importance
values. In practice, we recommend taking advantage of the
parallelization capabilities of the function to speed up the training
process, in addition to using more cross-validation folds and a denser
tuning grid to obtain a more robust model.
Now that we have an estimated self-correcting point process and a
trained mark model, we can check the fit of the model using the
check_model_fit
function. This function provides a variety
of non-parametric summaries for point processes and global envelope
tests to assess the goodness of fit of the model. The package provides
individual envelope tests for the L, F, G, J, E, and V statistics, or a
combined envelope test for all statistics simultaneously by making use
of the functionality of the spatstat
and GET
packages. By plotting the combined envelope test, we can visually assess
the goodness of fit of the model and obtain a p-value measuring how well
the estimated model captures the relationships observed in the reference
data. Typically a p-value less than 0.05 indicates a poor fit of the
model to the data, and the authors of the GET
package
recommend a minimum of 2500 simulations to obtain a valid p-value at the
.05 level.
# Generate the reference pattern
reference_data <- generate_mpp(
locations = small_example_data[, c("x", "y")],
marks = small_example_data$size,
xy_bounds = c(0, 25, 0, 25)
)
#> Registered S3 method overwritten by 'spatstat.geom':
#> method from
#> print.metric yardstick
# Define an anchor point
M_n <- as.matrix(small_example_data[1, c("x", "y")])
# Specify the estimated parameters of the self-correcting process
estimated_parameters <- optimal_parameters
# Check the model fit
example_model_fit <- check_model_fit(
reference_data = reference_data,
t_min = 0,
t_max = 1,
sc_params = estimated_parameters,
anchor_point = M_n,
raster_list = scaled_rasters,
mark_model = example_mark_model,
xy_bounds = c(0, 25, 0, 25),
include_comp_inds = TRUE,
thinning = TRUE,
correction = "none",
competition_radius = 10,
n_sim = 100,
save_sims = FALSE,
verbose = TRUE,
seed = 90210
)
#> Beginning Data Simulations...
#> Data Simulations Complete...
# Plot the combined global envelope test results
plot(example_model_fit$combined_env)
The combined envelope test provides a visual summary of the goodness of fit of the model to the reference data. The p-value of the test can be used to assess how well the fitted model reflects the reference data. If the p-value is less than 0.05, the model may not be a good fit for the data.
Finally, we can simulate datasets from the fitted model using the
simulate_mpp
function. This function generates a
realization of a marked point process given the estimated parameters of
the self-correcting process and a trained mark model. The function
allows for the specification of the number of points to simulate, the
time range to simulate over, the spatial domain to simulate over, and
the inclusion of competition indices. The function returns a list
containing the marked point process object and a data frame containing
the simulated locations, marks, and arrival times. The realization can
be visualized using the plot_mpp
function.
# Simulate a marked point process
simulated_mpp <- simulate_mpp(
sc_params = estimated_parameters,
t_min = 0,
t_max = 1,
anchor_point = M_n,
raster_list = scaled_rasters,
mark_model = example_mark_model,
xy_bounds = c(0, 25, 0, 25),
include_comp_inds = TRUE,
competition_radius = 10,
correction = "none",
thinning = TRUE
)
# Plot the simulated marked point process
plot_mpp(
mpp_data = simulated_mpp$mpp,
pattern_type = "simulated"
)
The plot_mpp
function provides a visual representation
of the simulated marked point process.