--- title: "Parameter calibration and cost functions" author: "Josefa ArĂ¡n" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parameter calibration and cost functions} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(rsofun) library(dplyr) library(ggplot2) ``` The `rsofun` package allows to calibrate parameters of the `pmodel` and `biomee` models via the `calib_sofun()` function. The implementation of the calibration is fairly flexible and can be adapted to a specific use-case via a cost function (used as metrics for the optimization routines in `calib_sofun()`). The package provides a set of standard cost functions named `cost_*`, which can be used for a variety of calibrations (different sets of model parameters, using various target variables, etc.). Alternatively, it's possible to write a more specific new cost function to be used together with `calib_sofun()`. In this vignette, we go over some examples on how to use the `rsofun` cost functions for parameter calibration and how to write your own custom one from scratch. ### Calibration to GPP using RMSE and GenSA optimizer A simple approach to parameter calibration is to find the parameter values that lead to the best prediction performance, in terms of the RMSE (root mean squared error). The function `cost_rmse_pmodel()` runs the P-model internally to calculate the RMSE between predicted target values (in this case GPP) and the corresponding observations. The implementation of `cost_rmse_pmodel()` allows flexibility in various ways. We can simultaneously calibrate a subset of model parameters and also replicate the different calibration setups in Stocker et al., 2020 GMD. For example, following the `ORG` setup, only parameter `kphio` is calibrated. Furthermore, the standard cost functions allow to calibrate to several targets (fluxes and leaf traits predicted by the P-model) simultaneously and to parallelize the simulations. Since the P-model is run internally to make predictions, we must always specify which values the model parameters should take, i.e. the parameters that aren't calibrated (via argument `par_fixed`). The syntax to run the calibration routine is as follows: ```{r eval = FALSE} # Define calibration settings and parameter ranges from previous work settings_rmse <- list( method = 'GenSA', # minimizes the RMSE metric = cost_rmse_pmodel, # our cost function control = list( # control parameters for optimizer GenSA maxit = 100), par = list( # bounds for the parameter space kphio = list(lower=0.02, upper=0.2, init=0.05) ) ) # Calibrate the model and optimize the free parameters using # demo datasets pars_calib_rmse <- calib_sofun( # calib_sofun arguments: drivers = p_model_drivers, obs = p_model_validation, settings = settings_rmse, # extra arguments passed to the cost function: par_fixed = list( # fix all other parameters kphio_par_a = 0.0, # set to zero to disable temperature-dependence # of kphio, setup ORG kphio_par_b = 1.0, soilm_thetastar = 0.6 * 240, # to recover paper setup with soil moisture stress soilm_betao = 0.0, beta_unitcostratio = 146.0, rd_to_vcmax = 0.014, # value from Atkin et al. 2015 for C3 herbaceous tau_acclim = 30.0, kc_jmax = 0.41 ), targets = "gpp" # define target variable GPP ) ``` The output of `calib_sofun()` is a list containing the calibrated parameter values and the raw optimization output from the optimizer (here from `GenSA` or, as we see next, from `BayesianTools::runMCMC`). ### Calibration to GPP using a simple likelihood function and BayesianTools Let's calibrate the parameters involved in the temperature dependency of the quantum yield efficiency, `kphio`, `kphio_par_a` and `kphio_par_b`, taking a Bayesian calibration approach. We assume that the target variable (`'gpp'`) follows a normal distribution centered at the observations and with its standard deviation being a new calibratable parameter (`'err_gpp'`). We also assume a uniform prior distribution for all calibratable parameters. By maximizing the normal log-likelihood, the MAP (maximum a posteriori) estimators for all 4 parameters are computed. With the function `cost_likelihood_pmodel()`, we can easily perform this calibration, as follows: ```{r eval = FALSE} # Define calibration settings settings_likelihood <- list( method = 'BayesianTools', metric = cost_likelihood_pmodel, # our cost function control = list( # optimization control settings for sampler = 'DEzs', # BayesianTools::runMCMC settings = list( burnin = 1500, iterations = 3000 )), par = list( kphio = list(lower = 0, upper = 0.2, init = 0.05), kphio_par_a = list(lower = -0.5, upper = 0.5, init = -0.1), kphio_par_b = list(lower = 10, upper = 40, init =25), err_gpp = list(lower = 0.1, upper = 4, init = 0.8) ) ) # Calibrate the model and optimize the free parameters using # demo datasets pars_calib_likelihood <- calib_sofun( # calib_sofun arguments: drivers = p_model_drivers, obs = p_model_validation, settings = settings_likelihood, # extra arguments passed ot the cost function: par_fixed = list( # fix all other parameters soilm_thetastar = 0.6 * 240, # to recover paper setup with soil moisture stress soilm_betao = 0.0, beta_unitcostratio = 146.0, rd_to_vcmax = 0.014, # value from Atkin et al. 2015 for C3 herbaceous tau_acclim = 30.0, kc_jmax = 0.41 ), targets = "gpp" ) ``` Furthermore, there are equivalent cost functions available for the BiomeE model. Check out the reference pages for more details on how to use `cost_likelihood_biomee()` and `cost_rmse_biomee()`. ### Calibration to GPP and Vcmax25 using the joint log-likelihood and BayesianTools You may be interested in calibrating the model to different target variables simultaneously, like flux and leaf trait measurements. Here we present an example, where we use `cost_likelihood_pmodel()` to compute the joint normal likelihood of all the targets specified (that is, by summing the log-likelihoods of GPP and Vcmax25) and ultimately calibrate the `kc_jmax` parameter. It would be possible to follow this workflow for several-target calibration also with RMSE as the optimization metric, using `cost_rmse_pmodel()` and `GenSA` optimization. ```{r eval = FALSE} # Define calibration settings for two targets settings_joint_likelihood <- list( method = "BayesianTools", metric = cost_likelihood_pmodel, control = list( sampler = "DEzs", settings = list( burnin = 1500, # kept artificially low iterations = 3000 )), par = list(kc_jmax = list(lower = 0.2, upper = 0.8, init = 0.41), # uniform priors err_gpp = list(lower = 0.001, upper = 4, init = 1), err_vcmax25 = list(lower = 0.000001, upper = 0.0001, init = 0.00001)) ) # Run the calibration on the concatenated data par_calib_join <- calib_sofun( drivers = rbind(p_model_drivers, p_model_drivers_vcmax25), obs = rbind(p_model_validation, p_model_validation_vcmax25), settings = settings_joint_likelihood, # arguments for the cost function par_fixed = list( # fix parameter value from previous calibration kphio = 0.041, kphio_par_a = 0.0, kphio_par_b = 16, soilm_thetastar = 0.6 * 240, # to recover paper setup with soil moisture stress soilm_betao = 0.0, beta_unitcostratio = 146.0, rd_to_vcmax = 0.014, # value from Atkin et al. 2015 for C3 herbaceous tau_acclim = 30.0 ), targets = c('gpp', 'vcmax25') ) ``` Note that GPP predictions are directly compared to GPP observations on that day, but Vcmax25 predicted by the P-model (being a leaf trait) is averaged over the growing season and compared to a single Vcmax25 observation taken per site. The cost functions provided in the package tell apart fluxes and leaf traits by the presence of a `"date"` column in the nested validation data frames `p_model_validation` and `p_model_validation_vcmax25`. ### Write your custom cost function If the RMSE or normal log-likelihood (for one or several targets) cost functions that we provide do not fit your use case, you can easily write a custom one. In this section, we drive you through the main ideas with an example. All cost functions must take at least three arguments: * `par`: A vector of calibratable model parameters. In each iteration of the optimization, a new set of values of `par` is used to run the model and compute the cost. * `obs`: A data frame of observations, against which to compare the simulation results. * `drivers`: A data frame of driver data, used to run the simulations. * Additional optional arguments can be used, like for example the model parameter values that should be fixed across simulations, etc. Since we are calibrating the parameters based on model outputs, the cost function runs the P-model and compare its output to observed validation data. ```{r, eval = FALSE} function(par, obs, drivers){ # Your code } ``` In the optimization procedure, the cost function only takes as argument the parameters `par` that are fed to `calib_sofun()` via `settings$par` (see previous sections). Nevertheless, within the cost function we call `runread_pmodel_f()` and this function needs a full set of model parameters. Therefore, the parameters that aren't being calibrated must be hard coded inside the cost function (or passed as an argument like the `rsofun` cost functions). In this example, we only want to calibrate the soil moisture stress parameters. ```{r, eval = FALSE} function(par, obs, drivers){ # Set values for the list of calibrated and non-calibrated model parameters params_modl <- list( kphio = 0.09423773, kphio_par_a = 0.0, kphio_par_b = 25, soilm_thetastar = par[1], soilm_betao = par[2], beta_unitcostratio = 146.0, rd_to_vcmax = 0.014, tau_acclim = 30.0, kc_jmax = 0.41 ) # Run the model df <- runread_pmodel_f( drivers, par = params_modl, makecheck = TRUE, parallel = FALSE ) # Your code to compute the cost } ``` The following chunk defines the final function. We clean the observations and model output and align the data according to site and date, to compute the mean absolute error (MAE) on GPP. Finally, the function should return a scalar value, in this case the MAE, which we want to minimize. Keep in mind that the GenSA optimization will minimize the cost, but with the BayesianTools method the cost is always maximized. ```{r} cost_mae <- function(par, obs, drivers){ # Set values for the list of calibrated and non-calibrated model parameters params_modl <- list( kphio = 0.09423773, kphio_par_a = 0.0, kphio_par_b = 25, soilm_thetastar = par[1], soilm_betao = par[2], beta_unitcostratio = 146.0, rd_to_vcmax = 0.014, tau_acclim = 30.0, kc_jmax = 0.41 ) # Set values for the list of calibrated and non-calibrated model parameters # Run the model df <- runread_pmodel_f( drivers = drivers, par = params_modl, makecheck = TRUE, parallel = FALSE ) # Clean model output to compute cost df <- df %>% dplyr::select(sitename, data) %>% tidyr::unnest(data) # Clean validation data to compute cost obs <- obs %>% dplyr::select(sitename, data) %>% tidyr::unnest(data) %>% dplyr::rename('gpp_obs' = 'gpp') # rename for later # Left join model output with observations by site and date df <- dplyr::left_join(df, obs, by = c('sitename', 'date')) # Compute mean absolute error cost <- mean(abs(df$gpp - df$gpp_obs), na.rm = TRUE) # Return the computed cost return(cost) } ``` As a last step, let's verify that the calibration procedure runs using this cost function. ```{r eval = FALSE} # Define calibration settings and parameter ranges settings_mae <- list( method = 'GenSA', metric = cost_mae, # our cost function control = list( maxit = 100), par = list( soilm_thetastar = list(lower=0.0, upper=3000, init=0.6*240), soilm_betao = list(lower=0, upper=1, init=0.2) ) ) # Calibrate the model and optimize the free parameters pars_calib_mae <- calib_sofun( drivers = p_model_drivers, obs = p_model_validation, settings = settings_mae ) ```