---
title: "heatwaveR internal workflow"
author: "AJ Smit"
date: "`r Sys.Date()`"
description: "This vignette outlines the internal workflow structure of the heatwaveR package."
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{heatwaveR internal workflow}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
This document outlines the internal workflow of **heatwaveR**.
```{mermaid}
flowchart TD
A(["ts2clm()"]) --> B[/INPUT:
date, temperature/]
B --> C{Checks?}
C --> |Yes| D[/OUTPUT:
date, temperature/]
C --> |No| E([End])
D --> F(["make_whole_fast()"])
G(["make_whole_fast()"]) --> H[/INPUT:
date, temperature/]
H --> I[/OUTPUT:
doy, date, temperature/]
I --> J(["clim_spread()"])
K(["clim_spread()"]) --> L[/INPUT:
doy, date, temperature/]
L ---> M[spread doy as rows
spread year as cols
grow doy by windowHalfWidth]
M --> N[/OUTPUT:
temperature in
doy x year matrix/]
N --> O(["clim_calc_cpp()"])
P(["clim_calc_cpp()"]) --> Q[/INPUT:
temperature in
doy x year matrix/]
Q -->
```
# `ts2clm()`
`ts2clm()` accepts a dataframe with date (`x = t`) and temperature (`y = temp`). Additional function arguments include:
1. `climatologyPeriod` Required. To this argument should be passed two values (see example below). The first value should be the chosen date for the start of the climatology period, and the second value the end date of said period. This chosen period (preferably 30 years in length) is then used to calculate the seasonal cycle and the extreme value threshold.
2. `maxPadLength` Specifies the maximum length of days over which to interpolate (pad) missing data (specified as `NA`) in the input temperature time series; i.e., any consecutive blocks of `NA`s with length greater than `maxPadLength` will be left as `NA`. The default is `FALSE`. Set as an integer to interpolate. Setting `maxPadLength` to `TRUE` will return an error.
3. `windowHalfWidth` Width of sliding window about day-of-year (to one side of the center day-of-year) used for the pooling of values and calculation of climatology and threshold percentile. Default is `5` days, which gives a window width of 11 days centred on the 6th day of the series of 11 days.
4. `pctile` Threshold percentile (%) for detection of events (MHWs). Default is `90`th percentile. Should the intent be to use these threshold data for MCSs, set `pctile = 10`. Or some other low value.
5. `smoothPercentile` Boolean switch selecting whether to smooth the climatology and threshold percentile time series with a moving average of `smoothPercentileWidth`. Default is `TRUE`.
6. `clmOnly` Choose to calculate and return only the climatologies. The default is `FALSE`.
7. `var` This argument has been introduced to allow the user to choose if the variance of the seasonal signal per `doy` should be calculated. The default of `FALSE` will prevent the calculation, potentially increasing speed of calculations on gridded data and reducing the size of the output. The variance was initially introduced as part of the standard output from Hobday et al. (2016), but few researchers use it and so it is generally regarded now as unnecessary.
8. `roundClm` This argument allows the user to choose how many decimal places the `seas` and `thresh` outputs will be rounded to. Default is 4. To prevent rounding set `roundClm = FALSE`. This argument may only be given numeric values or FALSE.
## Details
The function first checks for a `climatologyPeriod` consisting of a vectors of two dates (e.g. `c("1982-01-01", 2011-12-31)`) 1. It is advised that it must be least 30 years, but it can handle shorter durations.
1. Currently it does not weigh an unequal number of dates per year in cases when the duration of each year is not exactly 365 (or 366) days. **Weighting for unequal number of days per year in situations where the `climatologyPerdiod` comprises parts of years must be addressed in an update**.
2. This function supports leap years. Currently this is done by ignoring Feb 29s for the initial calculation of the climatology and threshold. The values for Feb 29 are then linearly interpolated from the values for Feb 28 and Mar 1. **In an update I'd suggest using the temperature data for Feb 29 and not interpolating them**.
3. Should the user be concerned about repeated measurements per day, we suggest that the necessary checks and fixes are implemented prior to feeding the time series to `ts2clm()`.
4. Much of the interval function depends on **data.table** because it is fast. **I suggest removing this dependence in favour of C++ code**. Also, **except for output of flat tables as `tibble`s, do not rely on the Tidyverse**.
## Value
The function will return a `tibble` (see the **tidyverse** package) with the input time series and the newly calculated climatology. The climatology contains the daily climatology and the threshold for calculating MHWs. The software was designed for creating climatologies of daily temperatures, and the units specified below reflect that intended purpose. However, various other kinds of climatologies may be created, and if that is the case, the appropriate units need to be determined by the user.
| Value | Description |
|---------------------|---------------------------------------------------|
| `doy` | Julian day (day-of-year). For non-leap years it runs 1...59 and 61...366, while leap years run 1...366. |
| `t` | The date vector in the original time series supplied in `data`. If an alternate column was provided to the `x` argument, that name will rather be used for this column. |
| `temp` | The measurement vector as per the the original `data` supplied to the function. If a different column was given to the `y` argument that will be shown here. |
| `seas` | Climatological seasonal cycle \[deg. C\]. |
| `thresh` | Seasonally varying threshold (e.g., 90th percentile) \[deg. C\]. This is used in `detect_event` for the detection/calculation of events (MHWs). |
| `var` | Seasonally varying variance (standard deviation) \[deg. C\]. This column is not returned if `var = FALSE` (default). |
Should `clmOnly` be enabled, only the 365 or 366 day climatology will be returned.
## Internal functions
### `make_whole_fast()`
This function constructs a continuous, uninterrupted time series of temperatures. It takes a series of dates and temperatures, and if irregular (but ordered), inserts missing dates and fills corresponding temperatures with `NA`s. It has only one argument and is fed data in a consistent format by early steps in `ts2clm()`:
1. `data` A data frame with columns for date (`ts_x`) and temperature (`ts_y`) data. Ordered daily data are expected, and although missing values (NA) can be accommodated, the function is only recommended when `NA`s occur infrequently, preferably at no more than three consecutive days.
#### Details
1. This function reads in daily data with the time vector specified as `Date` (e.g. "1982-01-01").
2. It is up to the user to calculate daily data from sub-daily measurements. Leap years are automatically accommodated by this function. **In a future update we need to be able to accommodate time series at a range of frequencies from sub-daily to monthly**.
3. This function can handle some missing days, but this is not a licence to actually use these data for the detection of anomalous thermal events. Hobday et al. (2016) recommend gaps of no more than 3 days, which may be adjusted by setting the `maxPadLength` argument of the `ts2clm` function. The longer and more frequent the gaps become the lower the fidelity of the annual climatology and threshold that can be calculated, which will not only have repercussions for the accuracy at which the event metrics can be determined, but also for the number of events that can be detected. **Currently there is no check for the number of `NA`s in the time series provided to `ts2clm()` and this can be added to future updates such that it fails (or sends a loud warning) when a threshold of maximum allowable `NA`s is exceeded**.
4. In this function we only set up the day-of-year (`doy`) vector in and insert rows in cases when the original data set has missing rows for some dates. Should the user be concerned about the potential for repeated measurements or worry that the time series is unordered, we suggest that the necessary checks and fixes are implemented prior to feeding the time series to `ts2clim()` via `make_whole_fast()`. When using the fast algorithm, we assume that the user has done all the necessary work to ensure that the time vector is ordered and without repeated measurements beforehand.
#### Value
The function will return a data frame with three columns. The column headed `doy` (day-of-year) is the Julian day running from 1 to 366, but modified so that the day-of-year series for non-leap-years runs 1...59 and then 61...366. For leap years the 60th day is February 29. The `ts_x` column is a series of dates of class `Date`, while `y` is the measured variable. This time series will be uninterrupted and continuous daily values between the first and last dates of the input data.