Regular Spacings and Seasonality

Introduction

A recent blog post
from Google Research introduced a compositional Bayesian neural network model and showcased it using the well-known monthly average \(CO_2\) concentrations from the Mauna Loa observatory. While the example is informative, the monthly series poses little challenge from a modeling perspective.

In contrast, this vignette explores the daily Mauna Loa \(CO_2\) dataset, which introduces substantial complexity due to its irregular sampling — data is not available for every day of the year. This irregularity poses problems for seasonal modeling, as uneven time intervals distort lag relationships and degrade model inference.

We demonstrate how the tsissm package, which natively supports missing values, can be used to reframe irregular series onto a regular time grid with gaps — improving both seasonal modeling and forecast accuracy.

Data Investigation

We begin by visualizing the daily Mauna Loa \(CO_2\) series:

Code

data("maunaloa")
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
plot(zoo(maunaloa$co2, maunaloa$date), type = "l", xlab = "", ylab = "", main = "Mauna Loa Daily Average CO2")

Code

par(opar)

This dataset spans more than 20 years of daily averages. There is a clear seasonal pattern and a persistent upward trend. However, a closer look reveals that the number of observations per year is not constant:

Code

d <- data.table(date = maunaloa$date, value = maunaloa$co2)
d[, year := year(date)]
s <- d[,list(.N), by = "year"]
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
barplot(s$N, names.arg = s$year, main = "Observations/Year", col = "steelblue")

Code

par(opar)

This inconsistency in daily coverage is the root of many issues when trying to fit seasonal models.

Naive Approach (Irregular Time Base)

A naive approach would be to directly model the series as-is, using a fixed seasonal frequency. For this example, we hold out the years 2024–2025 for evaluation:

Code

co2 <- xts(maunaloa$co2, maunaloa$date)
train <- co2["/2023"]
test <- co2["2024/"]
spec_naive <- issm_modelspec(train, slope = TRUE, seasonal = TRUE, seasonal_frequency = 320, seasonal_harmonics = 2)
# fixed slope
spec_naive$parmatrix[parameters == "beta", initial := 0]
spec_naive$parmatrix[parameters == "beta", lower := 0]
spec_naive$parmatrix[parameters == "beta", estimate := 0]
mod_naive <- estimate(spec_naive, scores = FALSE)
p_naive <- predict(mod_naive, h = length(test), nsim = 3000, forc_dates = index(test))
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
plot(p_naive, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Naive] Model Prediction", median_width = 1, xlab = "")
lines(index(test), as.numeric(test), col = 3)
legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n")

Code

par(opar)

The resulting forecast exhibits a phase shift in the seasonal component. This is expected — irregular time spacing distorts seasonality, resulting in inaccurate timing of peaks and troughs.

Correct Approach (Regular Time Base with Missing Values)

A better solution is to recast the series on a regular daily time grid, filling in missing observations with NA. Since tsissm supports missing values in the response variable \(y_t\), this allows us to model the series in calendar time while preserving seasonal structure.

In tsissm, if y_t is missing, the innovation \(\varepsilon_t\) is set to zero — its expected value — effectively treating the model output as a one-step-ahead forecast. These missing values are ignored in the likelihood and handled correctly during state filtering.

Code

full_dates <- seq(index(co2)[1], tail(index(co2),1), by = "days")
co2_full <- xts(rep(as.numeric(NA), length(full_dates)), full_dates)
co2_full <- cbind(co2_full, co2, join = "left")
co2_full <- co2_full[,-1]

This new dataset now spans a regular daily grid. The percentage increase in data points (mostly NA) is: 19% more data points which are all missing values.

Re-estimating with Regularized Calendar Time

Code

train_full <- co2_full["/2023"]
test_full <- co2_full["2024/"]
spec <- issm_modelspec(train_full, slope = TRUE, seasonal = TRUE, seasonal_frequency = 366, seasonal_harmonics = 5, distribution = "norm")
# fixed slope
spec$parmatrix[parameters == "beta", initial := 0]
spec$parmatrix[parameters == "beta", lower := 0]
spec$parmatrix[parameters == "beta", estimate := 0]
mod <- estimate(spec, scores = FALSE)
#> Warning in eval_f(x, ...): NaN or Infinite value in likelihood
#> Warning in eval_f(x, ...): NaN or Infinite value in likelihood
p <- predict(mod, h = length(test_full), nsim = 3000, forc_dates = index(test_full))
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
plot(p, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Correct] Model Prediction", median_width = 1, xlab = "")
lines(index(test_full), as.numeric(test_full), col = 3)
legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n")

Code

par(opar)

The forecast now clearly aligns better with the actual seasonal pattern — the phase error seen earlier has been corrected.

Evaluation

We compare forecast accuracy between the naive and corrected models using MAPE and CRPS. Only non-missing values are considered:

Code

use <- which(!is.na(as.numeric(test_full)))
tab <- data.frame(MAPE = c(mape(as.numeric(test), as.numeric(p_naive$analytic_mean)) * 100, 
                           mape(as.numeric(test_full)[use], as.numeric(p$analytic_mean)[use]) * 100),
                  CRPS = c(crps(as.numeric(test), p_naive$distribution), crps(as.numeric(test_full)[use], p$distribution[,use])))
rownames(tab) <- c("Naive","Correct")
colnames(tab) <- c("MAPE (%)", "CRPS")
tab |> kable(digits = 2)

	MAPE (%)	CRPS
Naive	0.55	1.57
Correct	0.26	0.78

The table shows a clear improvement in both metrics for the correctly specified model — particularly a halving of the MAPE.

Conclusion

Irregular sampling in time series — especially those with seasonal structure — can significantly impair model inference and forecasting. This vignette demonstrated how naive models fitted on irregular time bases can introduce phase errors and bias. By recasting the series onto a regular grid with missing values, and leveraging tsissm’s built-in support for such data, we were able to restore seasonal fidelity and greatly improve forecast accuracy.

This approach is general and applicable to many real-world time series where regular sampling cannot be guaranteed.