A recent blog post
from Google Research introduced a compositional Bayesian neural network
model and showcased it using the well-known monthly average \(CO_2\) concentrations from the Mauna Loa
observatory. While the example is informative, the monthly series poses
little challenge from a modeling perspective.
In contrast, this vignette explores the daily Mauna Loa \(CO_2\) dataset, which introduces substantial complexity due to its irregular sampling — data is not available for every day of the year. This irregularity poses problems for seasonal modeling, as uneven time intervals distort lag relationships and degrade model inference.
We demonstrate how the tsissm
package, which natively
supports missing values, can be used to reframe irregular series onto a
regular time grid with gaps — improving both seasonal modeling and
forecast accuracy.
We begin by visualizing the daily Mauna Loa \(CO_2\) series:
This dataset spans more than 20 years of daily averages. There is a clear seasonal pattern and a persistent upward trend. However, a closer look reveals that the number of observations per year is not constant:
This inconsistency in daily coverage is the root of many issues when trying to fit seasonal models.
A naive approach would be to directly model the series as-is, using a fixed seasonal frequency. For this example, we hold out the years 2024–2025 for evaluation:
co2 <- xts(maunaloa$co2, maunaloa$date)
train <- co2["/2023"]
test <- co2["2024/"]
spec_naive <- issm_modelspec(train, slope = TRUE, seasonal = TRUE, seasonal_frequency = 320, seasonal_harmonics = 2)
# fixed slope
spec_naive$parmatrix[parameters == "beta", initial := 0]
spec_naive$parmatrix[parameters == "beta", lower := 0]
spec_naive$parmatrix[parameters == "beta", estimate := 0]
mod_naive <- estimate(spec_naive, scores = FALSE)
p_naive <- predict(mod_naive, h = length(test), nsim = 3000, forc_dates = index(test))
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
plot(p_naive, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Naive] Model Prediction", median_width = 1, xlab = "")
lines(index(test), as.numeric(test), col = 3)
legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n")
The resulting forecast exhibits a phase shift in the seasonal component. This is expected — irregular time spacing distorts seasonality, resulting in inaccurate timing of peaks and troughs.
A better solution is to recast the series on a regular daily time
grid, filling in missing observations with NA. Since tsissm
supports missing values in the response variable \(y_t\), this allows us to model the series
in calendar time while preserving seasonal structure.
In tsissm
, if y_t is missing, the innovation \(\varepsilon_t\) is set to zero — its
expected value — effectively treating the model output as a
one-step-ahead forecast. These missing values are ignored in the
likelihood and handled correctly during state filtering.
This new dataset now spans a regular daily grid. The percentage increase in data points (mostly NA) is: 19% more data points which are all missing values.
train_full <- co2_full["/2023"]
test_full <- co2_full["2024/"]
spec <- issm_modelspec(train_full, slope = TRUE, seasonal = TRUE, seasonal_frequency = 366, seasonal_harmonics = 5, distribution = "norm")
# fixed slope
spec$parmatrix[parameters == "beta", initial := 0]
spec$parmatrix[parameters == "beta", lower := 0]
spec$parmatrix[parameters == "beta", estimate := 0]
mod <- estimate(spec, scores = FALSE)
p <- predict(mod, h = length(test_full), nsim = 3000, forc_dates = index(test_full))
opar <- par(mfrow = c(1,1))
par(mar = c(2,2,2,2))
plot(p, n_original = 600, gradient_color = "yellow", interval_color = "orange", main = "CO2 [Correct] Model Prediction", median_width = 1, xlab = "")
lines(index(test_full), as.numeric(test_full), col = 3)
legend("topleft", c("Median Forecast","Actual"), col = c(1,3), lty = 1, bty = "n")
The forecast now clearly aligns better with the actual seasonal pattern — the phase error seen earlier has been corrected.
We compare forecast accuracy between the naive and corrected models using MAPE and CRPS. Only non-missing values are considered:
use <- which(!is.na(as.numeric(test_full)))
tab <- data.frame(MAPE = c(mape(as.numeric(test), as.numeric(p_naive$analytic_mean)) * 100,
mape(as.numeric(test_full)[use], as.numeric(p$analytic_mean)[use]) * 100),
CRPS = c(crps(as.numeric(test), p_naive$distribution), crps(as.numeric(test_full)[use], p$distribution[,use])))
rownames(tab) <- c("Naive","Correct")
colnames(tab) <- c("MAPE (%)", "CRPS")
tab |> kable(digits = 2)
MAPE (%) | CRPS | |
---|---|---|
Naive | 0.55 | 1.57 |
Correct | 0.26 | 0.78 |
The table shows a clear improvement in both metrics for the correctly specified model — particularly a halving of the MAPE.
Irregular sampling in time series — especially those with seasonal
structure — can significantly impair model inference and forecasting.
This vignette demonstrated how naive models fitted on irregular time
bases can introduce phase errors and bias. By recasting the series onto
a regular grid with missing values, and leveraging tsissm’s
built-in support for such data, we were able to restore seasonal
fidelity and greatly improve forecast accuracy.
This approach is general and applicable to many real-world time series where regular sampling cannot be guaranteed.