Custom aggregation

Victor Granda (Sapfluxnet Team)

2023-01-25

sapfluxnetr package offers a very flexible but powerful API based on the tidyverse packages to aggregate and summarise the site/s data in the form of the sfn_metrics function. All the metrics family of functions (?metrics) make use of the sfn_metrics function under the hood. If you want full control to the statistics returned and aggregation periods, we recommend you to use this API. This vignette will show you how.

Pre-fixed summarising functions

  1. daily_metrics
  2. monthly_metrics
  3. predawn_metrics
  4. midday_metrics
  5. nightly_metrics
  6. daylight_metrics

See each function help for a detailed description and examples of use.

Custom summarising functions

daily_metrics and related functions return a complete set of metrics ready for use, but if you want different metrics you can supply your own summarising functions using the .funs argument.
The correct way of specifying the functions to use is described in the summarise_all help (?dplyr::summarise_all). The recommended way is a list of formulas with the function call:

# libraries
library(sapfluxnetr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

### only mean and sd at a daily scale
# data
data('ARG_TRE', package = 'sapfluxnetr')

# summarising funs (as a list of formulas)
custom_funs <- list(mean = ~ mean(., na.rm = TRUE), std_dev = ~ sd(., na.rm = TRUE))

# metrics
foo_simpler_metrics <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "General data for ARG_TRE"

foo_simpler_metrics[['sapf']]
#> # A tibble: 14 × 9
#>    TIMESTAMP           ARG_TRE…¹ ARG_T…² ARG_T…³ ARG_T…⁴ ARG_T…⁵ ARG_T…⁶ ARG_T…⁷
#>    <dttm>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 2009-11-17 00:00:00      308.    173.    303.    255.    20.7    23.2    14.0
#>  2 2009-11-18 00:00:00      507.    376.    432.    490.   170.    174.    130. 
#>  3 2009-11-19 00:00:00      541.    380.    391.    524.   262.    169.    150. 
#>  4 2009-11-20 00:00:00      330.    218.    272.    334.   139.     67.2    74.6
#>  5 2009-11-21 00:00:00      338.    219.    278.    351.   190.    108.    113. 
#>  6 2009-11-22 00:00:00      384.    243.    310.    383.   268.    172.    184. 
#>  7 2009-11-23 00:00:00      492.    300.    390.    513.   327.    200.    228. 
#>  8 2009-11-24 00:00:00      573.    389.    497.    626.   313.    222.    261. 
#>  9 2009-11-25 00:00:00      601.    400.    484.    644.   193.    133.    170. 
#> 10 2009-11-26 00:00:00      502.    360.    450.    613.   277.    233.    308. 
#> 11 2009-11-27 00:00:00      544.    411.    506.    740.   271.    221.    285. 
#> 12 2009-11-28 00:00:00      573.    451.    589.    840.   180.    169.    249. 
#> 13 2009-11-29 00:00:00      371.    285.    357.    547.   233.    220.    197. 
#> 14 2009-11-30 00:00:00      386.    293.    381.    602.   273.    209.    288. 
#> # … with 1 more variable: ARG_TRE_Nan_Jt_4_std_dev <dbl>, and abbreviated
#> #   variable names ¹​ARG_TRE_Nan_Jt_1_mean, ²​ARG_TRE_Nan_Jt_2_mean,
#> #   ³​ARG_TRE_Nan_Jt_3_mean, ⁴​ARG_TRE_Nan_Jt_4_mean, ⁵​ARG_TRE_Nan_Jt_1_std_dev,
#> #   ⁶​ARG_TRE_Nan_Jt_2_std_dev, ⁷​ARG_TRE_Nan_Jt_3_std_dev

When supplying only one function to .funs, names of variables are not changed to contain the metric name at the end, as the summary function returns the same columns as the original data

Special interest intervals

You can also choose if the “special interest” intervals (predawn, midday, nighttime or daylight) are calculated or not. For example, if you are only interested in the midday interval you can use:

foo_simpler_metrics_midday <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'midday', int_start = 11, int_end = 13
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "midday data for ARG_TRE"

foo_simpler_metrics_midday[['sapf']]
#> # A tibble: 13 × 9
#>    TIMESTAMP_md        ARG_TRE…¹ ARG_T…² ARG_T…³ ARG_T…⁴ ARG_T…⁵ ARG_T…⁶ ARG_T…⁷
#>    <dttm>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 2009-11-18 00:00:00      685.    665.    614.    719.   23.8    70.1    40.1 
#>  2 2009-11-19 00:00:00      879.    594.    626.    664.  193.     67.3     9.82
#>  3 2009-11-20 00:00:00      438.    272.    258.    474.  116.     54.2    63.1 
#>  4 2009-11-21 00:00:00      631.    379.    533.    619.   46.8    25.1     7.69
#>  5 2009-11-22 00:00:00      783.    535.    680.    875.   40.9    42.4   194.  
#>  6 2009-11-23 00:00:00      841.    478.    618.   1018.   13.3     9.47    9.94
#>  7 2009-11-24 00:00:00      951.    636.    789.    829.   27.0     1.00  168.  
#>  8 2009-11-25 00:00:00      907.    602.    789.    913.   22.1    31.0   229.  
#>  9 2009-11-26 00:00:00      861.    697.    925.   1265.  100.     97.8   229.  
#> 10 2009-11-27 00:00:00      806.    594.    706.   1044.    1.67   42.8     4.14
#> 11 2009-11-28 00:00:00      837.    730.    925.   1313.   11.2    30.3   228.  
#> 12 2009-11-29 00:00:00      638.    605.    666.   1333.   40.7    29.5    49.5 
#> 13 2009-11-30 00:00:00      548.    371.    444.    961.   44.6   149.    222.  
#> # … with 1 more variable: ARG_TRE_Nan_Jt_4_std_dev_md <dbl>, and abbreviated
#> #   variable names ¹​ARG_TRE_Nan_Jt_1_mean_md, ²​ARG_TRE_Nan_Jt_2_mean_md,
#> #   ³​ARG_TRE_Nan_Jt_3_mean_md, ⁴​ARG_TRE_Nan_Jt_4_mean_md,
#> #   ⁵​ARG_TRE_Nan_Jt_1_std_dev_md, ⁶​ARG_TRE_Nan_Jt_2_std_dev_md,
#> #   ⁷​ARG_TRE_Nan_Jt_3_std_dev_md

Custom aggregation periods

period argument in sfn_metrics is passed to .collapse_timestamp function, and so, it can use the same input:

# weekly
foo_weekly <- sfn_metrics(
  ARG_TRE,
  period = '7 days',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "General data for ARG_TRE"

foo_weekly[['env']]
#> # A tibble: 3 × 19
#>   TIMESTAMP           ta_mean rh_mean vpd_mean sw_in_m…¹ ws_mean preci…² swc_s…³
#>   <dttm>                <dbl>   <dbl>    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
#> 1 2009-11-15 00:00:00    4.81    35.3    0.598      280.    15.5 0.00612   0.365
#> 2 2009-11-22 00:00:00    6.15    35.3    0.656      327.    24.5 0.192     0.348
#> 3 2009-11-29 00:00:00    2.55    40.9    0.453      261.    23.1 0.122     0.374
#> # … with 11 more variables: ppfd_in_mean <dbl>, ext_rad_mean <dbl>,
#> #   ta_std_dev <dbl>, rh_std_dev <dbl>, vpd_std_dev <dbl>, sw_in_std_dev <dbl>,
#> #   ws_std_dev <dbl>, precip_std_dev <dbl>, swc_shallow_std_dev <dbl>,
#> #   ppfd_in_std_dev <dbl>, ext_rad_std_dev <dbl>, and abbreviated variable
#> #   names ¹​sw_in_mean, ²​precip_mean, ³​swc_shallow_mean
foo_custom <- sfn_metrics(
  AUS_CAN_ST2_MIX,
  period = lubridate::quarter,
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general',
  with_year = TRUE # argument for lubridate::quarter
)
#> [1] "Crunching data for AUS_CAN_ST2_MIX. In large datasets this could take a while"
#> Warning in .period_to_minutes(period, .data$TIMESTAMP, unique(.data$timestep)): when using a custom function as period, coverage calculation
#>             can be less accurate

#> Warning in .period_to_minutes(period, .data$TIMESTAMP, unique(.data$timestep)): when using a custom function as period, coverage calculation
#>             can be less accurate
#> [1] "General data for AUS_CAN_ST2_MIX"
foo_custom['env']
#> $env
#> # A tibble: 5 × 17
#>   TIMESTAMP ta_mean vpd_mean sw_in_mean ws_mean precip…¹ ppfd_…² rh_mean ext_r…³
#>       <dbl>   <dbl>    <dbl>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
#> 1     2006.    8.48    0.106       25.4   0.161   0.0158    53.7    93.1    166.
#> 2     2006.   10.5     0.278       82.0   0.318   0.0399   173.     86.9    245.
#> 3     2006.   16.6     0.826      219.    0.581   0.0200   463.     72.8    468.
#> 4     2007.   20.9     0.985      197.    0.439   0.0333   416.     75.7    435.
#> 5     2007.   15.8     0.386      110.    0.200   0.0181   231.     86.7    213.
#> # … with 8 more variables: ta_std_dev <dbl>, vpd_std_dev <dbl>,
#> #   sw_in_std_dev <dbl>, ws_std_dev <dbl>, precip_std_dev <dbl>,
#> #   ppfd_in_std_dev <dbl>, rh_std_dev <dbl>, ext_rad_std_dev <dbl>, and
#> #   abbreviated variable names ¹​precip_mean, ²​ppfd_in_mean, ³​ext_rad_mean

Extra parameters

sfn_metrics has a ... parameter intended to supply additional parameters to the internal functions used:

  1. .collapse_timestamp accepts the following extra arguments:

    • side
      “start” by default in the sfn_metrics implementation
  2. dplyr::summarise_all accepts extra arguments intended to be applied to the summarising functions provided (to all, so they all must have the argument provided or an error will be raised). That’s the reason because we recommend to use the list way, as the arguments are specified for the individual functions.

For example, if we want the TIMESTAMPs after aggregation to show the end of the period instead the beginning (default) we can do the following:

foo_simpler_metrics_end <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general',
  side = "end"
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "General data for ARG_TRE"

foo_simpler_metrics_end[['sapf']]
#> # A tibble: 14 × 9
#>    TIMESTAMP           ARG_TRE…¹ ARG_T…² ARG_T…³ ARG_T…⁴ ARG_T…⁵ ARG_T…⁶ ARG_T…⁷
#>    <dttm>                  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 2009-11-18 00:00:00      308.    173.    303.    255.    20.7    23.2    14.0
#>  2 2009-11-19 00:00:00      507.    376.    432.    490.   170.    174.    130. 
#>  3 2009-11-20 00:00:00      541.    380.    391.    524.   262.    169.    150. 
#>  4 2009-11-21 00:00:00      330.    218.    272.    334.   139.     67.2    74.6
#>  5 2009-11-22 00:00:00      338.    219.    278.    351.   190.    108.    113. 
#>  6 2009-11-23 00:00:00      384.    243.    310.    383.   268.    172.    184. 
#>  7 2009-11-24 00:00:00      492.    300.    390.    513.   327.    200.    228. 
#>  8 2009-11-25 00:00:00      573.    389.    497.    626.   313.    222.    261. 
#>  9 2009-11-26 00:00:00      601.    400.    484.    644.   193.    133.    170. 
#> 10 2009-11-27 00:00:00      502.    360.    450.    613.   277.    233.    308. 
#> 11 2009-11-28 00:00:00      544.    411.    506.    740.   271.    221.    285. 
#> 12 2009-11-29 00:00:00      573.    451.    589.    840.   180.    169.    249. 
#> 13 2009-11-30 00:00:00      371.    285.    357.    547.   233.    220.    197. 
#> 14 2009-12-01 00:00:00      386.    293.    381.    602.   273.    209.    288. 
#> # … with 1 more variable: ARG_TRE_Nan_Jt_4_std_dev <dbl>, and abbreviated
#> #   variable names ¹​ARG_TRE_Nan_Jt_1_mean, ²​ARG_TRE_Nan_Jt_2_mean,
#> #   ³​ARG_TRE_Nan_Jt_3_mean, ⁴​ARG_TRE_Nan_Jt_4_mean, ⁵​ARG_TRE_Nan_Jt_1_std_dev,
#> #   ⁶​ARG_TRE_Nan_Jt_2_std_dev, ⁷​ARG_TRE_Nan_Jt_3_std_dev

If it is compared with the foo_simpler_metrics calculated before, now the period is identified in the TIMESTAMP by the ending of the period (daily in this case).

When supplying custom functions as “period” argument, the default coverage statistic is not reliable as there is no way of knowing beforehand the period/s in minutes.

Temporary columns helpers

The internal aggregation process in sfn_metrics generates some transitory columns which can be used in the summarising functions:

TIMESTAMP_coll

When aggregating by the declared period (i.e. "daily"), the TIMESTAMP column collapses to the period start/end value (meaning thet all the TIMESTAMP values for the same day becomes identical).
This makes impossible to use any summarise functions thet obtain the time of the day at which one event happens (i.e. time of the day at which the maximum sap flow occurs) because all TIMESTAMP values are identical. For thet kind of summarising functions, a transitory column called TIMESTAMP_coll is created. So in this case we can create a function thet takes de variable values for the day, the TIMESTAMP_coll values for the day and return the TIMESTAMP at which the max sap flow occurs and use it with sfn_metrics:

max_time <- function(x, time) {
  
  # x: vector of values for a day
  # time: TIMESTAMP for the day 

  # if all the values in x are NAs (a daily summmarise of no measures day for
  # example) this will return a length 0 POSIXct vector, which will crash
  # dplyr summarise step. So, check if all NA and if true return NA as POSIXct
  if(all(is.na(x))) {
    return(as.POSIXct(NA, tz = attr(time, 'tz'), origin = lubridate::origin))
  } else {
    time[which.max(x)]
  }
}

custom_funs <- list(max = ~ max(., na.rm = TRUE), ~ max_time(., TIMESTAMP_coll))

max_time_metrics <- sfn_metrics(
  ARG_TRE,
  period = '1 day',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "General data for ARG_TRE"

max_time_metrics[['sapf']]
#> # A tibble: 14 × 9
#>    TIMESTAMP           ARG_TRE_Nan…¹ ARG_T…² ARG_T…³ ARG_T…⁴ ARG_TRE_Nan_Jt_1_…⁵
#>    <dttm>                      <dbl>   <dbl>   <dbl>   <dbl> <dttm>             
#>  1 2009-11-17 00:00:00          322.    190.    313.    293. 2009-11-17 22:24:58
#>  2 2009-11-18 00:00:00          778.    715.    679.    948. 2009-11-18 13:24:43
#>  3 2009-11-19 00:00:00         1015.    694.    633.    978. 2009-11-19 12:24:26
#>  4 2009-11-20 00:00:00          648.    401.    442.    636. 2009-11-20 13:24:10
#>  5 2009-11-21 00:00:00          664.    406.    539.    633. 2009-11-21 12:23:52
#>  6 2009-11-22 00:00:00          812.    564.    816.    877. 2009-11-22 12:23:34
#>  7 2009-11-23 00:00:00         1085.    676.    935.   1150. 2009-11-23 13:23:15
#>  8 2009-11-24 00:00:00          992.    736.   1115.   1547. 2009-11-24 17:22:56
#>  9 2009-11-25 00:00:00          976.    646.    951.   1027. 2009-11-25 10:22:36
#> 10 2009-11-26 00:00:00          932.    766.   1087.   1384. 2009-11-26 12:22:15
#> 11 2009-11-27 00:00:00          862.    704.    921.   1193. 2009-11-27 16:21:54
#> 12 2009-11-28 00:00:00          845.    763.   1165.   1706. 2009-11-28 11:21:33
#> 13 2009-11-29 00:00:00          714.    747.    701.   1633. 2009-11-29 13:21:11
#> 14 2009-11-30 00:00:00          875.    646.    919.   1853. 2009-11-30 15:20:48
#> # … with 3 more variables: ARG_TRE_Nan_Jt_2_max_time <dttm>,
#> #   ARG_TRE_Nan_Jt_3_max_time <dttm>, ARG_TRE_Nan_Jt_4_max_time <dttm>, and
#> #   abbreviated variable names ¹​ARG_TRE_Nan_Jt_1_max, ²​ARG_TRE_Nan_Jt_2_max,
#> #   ³​ARG_TRE_Nan_Jt_3_max, ⁴​ARG_TRE_Nan_Jt_4_max, ⁵​ARG_TRE_Nan_Jt_1_max_time

Sub-daily aggregations

sfn_metrics allows to perform sub-daily aggregations, by means of the period parameter. Sapfluxnet datasets have sub-daily data usually in the range of 30 minutes to 2 hours. This means thet data can be aggregated in periods above 2 hours. We can aggregate to a 3 hours period easily:

custom_funs <- list(max = ~ max(., na.rm = TRUE))

three_hours_agg <- sfn_metrics(
  ARG_TRE,
  period = '3 hours',
  .funs = custom_funs,
  solar = TRUE,
  interval = 'general'
)
#> [1] "Crunching data for ARG_TRE. In large datasets this could take a while"
#> [1] "General data for ARG_TRE"

three_hours_agg[['sapf']]
#> # A tibble: 105 × 5
#>    TIMESTAMP           ARG_TRE_Nan_Jt_1_max ARG_TRE_Nan_Jt_2_max ARG_T…¹ ARG_T…²
#>    <dttm>                             <dbl>                <dbl>   <dbl>   <dbl>
#>  1 2009-11-17 21:00:00                 322.                 190.    313.    293.
#>  2 2009-11-18 00:00:00                 301.                 178.    331.    309.
#>  3 2009-11-18 03:00:00                 343.                 198.    301.    285.
#>  4 2009-11-18 06:00:00                 504.                 386.    406.    428.
#>  5 2009-11-18 09:00:00                 698.                 715.    642.    647.
#>  6 2009-11-18 12:00:00                 778.                 617.    679.    948.
#>  7 2009-11-18 15:00:00                 724.                 531.    603.    624.
#>  8 2009-11-18 18:00:00                 660.                 514.    517.    693.
#>  9 2009-11-18 21:00:00                 384.                 261.    348.    403.
#> 10 2009-11-19 00:00:00                 403.                 339.    313.    396.
#> # … with 95 more rows, and abbreviated variable names ¹​ARG_TRE_Nan_Jt_3_max,
#> #   ²​ARG_TRE_Nan_Jt_4_max