--- title: "Regular Spacings and Seasonality" output: rmarkdown::html_vignette: css: custom.css code_folding: hide toc: yes vignette: > %\VignetteIndexEntry{Regular Spacings and Seasonality} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, warning=FALSE,message=FALSE} library(tsissm) library(xts) library(data.table) library(tsaux) library(knitr) library(zoo) ``` ## Introduction A recent blog [post](https://research.google/blog/autobnn-probabilistic-time-series-forecasting-with-compositional-bayesian-neural-networks/) from Google Research introduced a compositional Bayesian neural network model and showcased it using the well-known monthly average $CO_2$ concentrations from the Mauna Loa observatory. While the example is informative, the monthly series poses little challenge from a modeling perspective. In contrast, this vignette explores the daily Mauna Loa $CO_2$ dataset, which introduces substantial complexity due to its irregular sampling — data is not available for every day of the year. This irregularity poses problems for seasonal modeling, as uneven time intervals distort lag relationships and degrade model inference. We demonstrate how the `tsissm` package, which natively supports missing values, can be used to reframe irregular series onto a regular time grid with gaps — improving both seasonal modeling and forecast accuracy. ## Data Investigation We begin by visualizing the daily Mauna Loa $CO_2$ series: ```{r, fig.width=6,fig.height=3} data("maunaloa") opar <- par(mfrow = c(1,1)) par(mar = c(2,2,2,2)) plot(zoo(maunaloa$co2, maunaloa$date), type = "l", xlab = "", ylab = "", main = "Mauna Loa Daily Average CO2") par(opar) ``` This dataset spans more than 20 years of daily averages. There is a clear seasonal pattern and a persistent upward trend. However, a closer look reveals that the number of observations per year is not constant: ```{r, fig.width=6,fig.height=3} d <- data.table(date = maunaloa$date, value = maunaloa$co2) d[, year := year(date)] s <- d[,list(.N), by = "year"] opar <- par(mfrow = c(1,1)) par(mar = c(2,2,2,2)) barplot(s$N, names.arg = s$year, main = "Observations/Year", col = "steelblue") par(opar) ``` This inconsistency in daily coverage is the root of many issues when trying to fit seasonal models. ## Naive Approach (Irregular Time Base) A naive approach would be to directly model the series as-is, using a fixed seasonal frequency. For this example, we hold out the years 2024–2025 for evaluation: ```{r, fig.width=6,fig.height=3} co2 <- xts(maunaloa$co2, maunaloa$date) train <- co2["/2023"] test <- co2["2024/"] spec_naive <- issm_modelspec(train, slope = TRUE, seasonal = TRUE, seasonal_frequency = 320, seasonal_harmonics = 2) # fixed slope spec_naive$parmatrix[parameters == "beta", initial := 0] spec_naive$parmatrix[parameters == "beta", lower := 0] spec_naive$parmatrix[parameters == "beta", estimate := 0] mod_naive <- estimate(spec_naive, scores = FALSE) p_naive <- predict(mod_naive, h = length(test), nsim = 3000, forc_dates = index(test)) opar <- par(mfrow = c(1,1)) par(mar = c(2,2,2,2)) plot(p_naive, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Naive] Model Prediction", median_width = 1, xlab = "") lines(index(test), as.numeric(test), col = 3) legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n") par(opar) ``` The resulting forecast exhibits a phase shift in the seasonal component. This is expected — irregular time spacing distorts seasonality, resulting in inaccurate timing of peaks and troughs. ## Correct Approach (Regular Time Base with Missing Values) A better solution is to recast the series on a regular daily time grid, filling in missing observations with NA. Since `tsissm` supports missing values in the response variable $y_t$, this allows us to model the series in calendar time while preserving seasonal structure. In `tsissm`, if y_t is missing, the innovation $\varepsilon_t$ is set to zero — its expected value — effectively treating the model output as a one-step-ahead forecast. These missing values are ignored in the likelihood and handled correctly during state filtering. ```{r} full_dates <- seq(index(co2)[1], tail(index(co2),1), by = "days") co2_full <- xts(rep(as.numeric(NA), length(full_dates)), full_dates) co2_full <- cbind(co2_full, co2, join = "left") co2_full <- co2_full[,-1] ``` This new dataset now spans a regular daily grid. The percentage increase in data points (mostly NA) is: `r paste0(round(100 * (NROW(co2_full)/NROW(co2) - 1),0), "%")` more data points which are all missing values. ## Re-estimating with Regularized Calendar Time ```{r, fig.width=6,fig.height=3} train_full <- co2_full["/2023"] test_full <- co2_full["2024/"] spec <- issm_modelspec(train_full, slope = TRUE, seasonal = TRUE, seasonal_frequency = 366, seasonal_harmonics = 5, distribution = "norm") # fixed slope spec$parmatrix[parameters == "beta", initial := 0] spec$parmatrix[parameters == "beta", lower := 0] spec$parmatrix[parameters == "beta", estimate := 0] mod <- estimate(spec, scores = FALSE) p <- predict(mod, h = length(test_full), nsim = 3000, forc_dates = index(test_full)) opar <- par(mfrow = c(1,1)) par(mar = c(2,2,2,2)) plot(p, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Correct] Model Prediction", median_width = 1, xlab = "") lines(index(test_full), as.numeric(test_full), col = 3) legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n") par(opar) ``` The forecast now clearly aligns better with the actual seasonal pattern — the phase error seen earlier has been corrected. ## Evaluation We compare forecast accuracy between the naive and corrected models using MAPE and CRPS. Only non-missing values are considered: ```{r} use <- which(!is.na(as.numeric(test_full))) tab <- data.frame(MAPE = c(mape(as.numeric(test), as.numeric(p_naive$analytic_mean)) * 100, mape(as.numeric(test_full)[use], as.numeric(p$analytic_mean)[use]) * 100), CRPS = c(crps(as.numeric(test), p_naive$distribution), crps(as.numeric(test_full)[use], p$distribution[,use]))) rownames(tab) <- c("Naive","Correct") colnames(tab) <- c("MAPE (%)", "CRPS") tab |> kable(digits = 2) ``` The table shows a clear improvement in both metrics for the correctly specified model — particularly a halving of the MAPE. ## Conclusion Irregular sampling in time series — especially those with seasonal structure — can significantly impair model inference and forecasting. This vignette demonstrated how naive models fitted on irregular time bases can introduce phase errors and bias. By recasting the series onto a regular grid with missing values, and leveraging `tsissm’s` built-in support for such data, we were able to restore seasonal fidelity and greatly improve forecast accuracy. This approach is general and applicable to many real-world time series where regular sampling cannot be guaranteed.