--- title: "TSrepr: Time series representations in a use case" author: "Peter Laurinec" date: "`r Sys.Date()`" bibliography: papers.bib url: https://petolau.github.io/package/ output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Use case: clustering time series representations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In this vignette (tutorial), I will show you one use case to use time series representations effectively. This use case is **clustering** of time series and it will be clustering of consumers of electricity load. ## Clustering of time series representations By clustering of consumers of electricity load, we can extract typical load **profiles**, improve the accuracy of consequent electricity consumption forecasting, detect anomalies or monitor a whole smart grid (grid of consumers) (@Lau1, @Lau2, @Laurinec2018). I will show you the first use case, the extraction of typical electricity load profiles by [**K-means**](https://en.wikipedia.org/wiki/K-means_clustering) clustering method. Firstly, let's load the required packages, data (`elec_load`) and return the dimensions of the dataset. ```{r, message=F} library(TSrepr) library(ggplot2) library(data.table) data("elec_load") dim(elec_load) ``` There are 50 time series of length 672, specifically time series of length 2 weeks. It is obvious that dimensionality is too high and the [**curse of dimensionality**](https://en.wikipedia.org/wiki/Curse_of_dimensionality) can happen. For this reason, we have to reduce dimensionality in some way. One of the best approaches is to use time series representations in order to reduce dimensionality, reduce noise and emphasize the main characteristics of time series. For double seasonal time series of electricity consumption (daily and weekly seasonality), the model-based representation approaches seem to have best ability to extract typical profiles. However, I will use some other representations too. Let's use one of the basic model-based representation methods - **mean seasonal profile**. It is implemented in the `repr_seas_profile` function and we will use it alongside `repr_matrix` function that computes representations for every row of a matrix of time series. One more very important notice here, normalisation of time series is a necessary procedure before every clustering or classification of time series. It is due to a fact that we want to extract typical curves of consumption and don't cluster based on an amount of consumption. By using the **TSrepr** package, we can do it all in one function - `repr_matrix`. We will use z-score normalisation implemented in `norm_z` function. ```{r} data_seasprof <- repr_matrix(elec_load, func = repr_seas_profile, args = list(freq = 48, func = mean), normalise = TRUE, func_norm = norm_z) dim(data_seasprof) ``` We can see that dimensionality was in fact reduced significantly. Now, let's use the K-means (`kmeans` function) clustering method to extract typical consumption profiles. I will set the number of clusters to 5. ```{r} res_km <- kmeans(data_seasprof, 5, nstart = 10) ``` Let's plot the results of clustering. ```{r, warning=F, message=F, fig.height=5.5, fig.width=7} # prepare data for plotting data_plot <- melt(data.table(ID = 1:nrow(data_seasprof), class = res_km$cluster, data_seasprof), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) data_plot[, Time := as.integer(gsub("V", "", Time))] # prepare centroids centers <- melt(data.table(ID = 1:nrow(res_km$centers), class = 1:nrow(res_km$centers), res_km$centers), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) centers[, Time := as.integer(gsub("V", "", Time))] # plot the results ggplot(data_plot, aes(Time, value, group = ID)) + facet_wrap(~class, ncol = 2, scales = "free_y") + geom_line(color = "grey10", alpha = 0.65) + geom_line(data = centers, aes(Time, value), color = "firebrick1", alpha = 0.80, size = 1.2) + labs(x = "Time", y = "Load (normalised)") + theme_bw() ``` We can see 4 typical extracted profiles - red lines (centroids of clusters). Now, let's try some more sophisticated method for the extraction of seasonal profiles - **GAM regression coefficients**. By `repr_gam` function, we can extract double seasonal regression coefficients - daily and weekly. However, the number of weekly seasonal regression coefficients will be only 6, because the number of the second seasonality coefficients is set to freq_2 / freq_1, so 48*7 / 48 = 7, and 7 - 1 = 6 because of splines computation (differentiation). I will again use `repr_matrix` function. ```{r} data_gam <- repr_matrix(elec_load, func = repr_gam, args = list(freq = c(48, 48*7)), normalise = TRUE, func_norm = norm_z) dim(data_gam) ``` So the dimension is 47 + 6 = 53 because of usage of splines in GAM method. Let's cluster the data and visualize it. ```{r} res_km <- kmeans(data_gam, 5, nstart = 10) ``` ```{r, warning=F, message=F, fig.height=5.5, fig.width=7} # prepare data for plotting data_plot <- melt(data.table(ID = 1:nrow(data_gam), class = res_km$cluster, data_gam), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) data_plot[, Time := as.integer(gsub("V", "", Time))] # prepare centroids centers <- melt(data.table(ID = 1:nrow(res_km$centers), class = 1:nrow(res_km$centers), res_km$centers), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) centers[, Time := as.integer(gsub("V", "", Time))] # plot the results ggplot(data_plot, aes(Time, value, group = ID)) + facet_wrap(~class, ncol = 2, scales = "free_y") + geom_line(color = "grey10", alpha = 0.65) + geom_line(data = centers, aes(Time, value), color = "firebrick1", alpha = 0.80, size = 1.2) + geom_vline(xintercept = 46, color = "dodgerblue2", size = 1.4, linetype = 5, alpha = 0.8) + labs(x = "Time", y = "Load (normalised)") + theme_bw() ``` That is a more delightful result, isn't it? Extracted consumption profiles are smoother than in the case of average seasonal profiles. The blue dashed line borders daily and weekly seasonal coefficients. I will show you also result of clustering of some nondata adaptive representation, let's pick for example **DFT** method (@fft) and extract first 48 DFT coefficients. ```{r} data_dft <- repr_matrix(elec_load, func = repr_dft, args = list(coef = 48), normalise = TRUE, func_norm = norm_z) dim(data_dft) ``` ```{r} res_km <- kmeans(data_dft, 5, nstart = 10) ``` ```{r, warning=F, message=F, fig.height=5.5, fig.width=7} # prepare data for plotting data_plot <- melt(data.table(ID = 1:nrow(data_dft), class = res_km$cluster, data_dft), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) data_plot[, Time := as.integer(gsub("V", "", Time))] # prepare centroids centers <- melt(data.table(ID = 1:nrow(res_km$centers), class = 1:nrow(res_km$centers), res_km$centers), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) centers[, Time := as.integer(gsub("V", "", Time))] # plot the results ggplot(data_plot, aes(Time, value, group = ID)) + facet_wrap(~class, ncol = 2, scales = "free_y") + geom_line(color = "grey10", alpha = 0.65) + geom_line(data = centers, aes(Time, value), color = "firebrick1", alpha = 0.80, size = 1.2) + labs(x = "Time", y = "Load (normalised)") + theme_bw() ``` The interpretability of these results would be difficult. Therefore, model-based time series representations are very effective in this use case. I will show you the usage of one more representation method - **FeaClip**. The **FeaClip** is feature extraction method from a clipping representation (@Laurinec2018). The windowing approach alongside FeaClip is recommended to use for every day of time series. The big advantage is that normalisation is not needed alongside of FeaClip method. Let's use it in our case. ```{r} data_feaclip <- repr_matrix(elec_load, func = repr_feaclip, windowing = TRUE, win_size = 48) dim(data_feaclip) ``` ```{r} res_km <- kmeans(data_feaclip, 5, nstart = 10) ``` ```{r, warning=F, message=F, fig.height=5.5, fig.width=7} # prepare data for plotting data_plot <- melt(data.table(ID = 1:nrow(data_feaclip), class = res_km$cluster, data_feaclip), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) data_plot[, Time := as.integer(gsub("V", "", Time))] # prepare centroids centers <- melt(data.table(ID = 1:nrow(res_km$centers), class = 1:nrow(res_km$centers), res_km$centers), id.vars = c("ID", "class"), variable.name = "Time", variable.factor = FALSE ) centers[, Time := as.integer(gsub("V", "", Time))] # plot the results ggplot(data_plot, aes(Time, value, group = ID)) + facet_wrap(~class, ncol = 2, scales = "free_y") + geom_line(color = "grey10", alpha = 0.65) + geom_line(data = centers, aes(Time, value), color = "firebrick1", alpha = 0.80, size = 1.2) + labs(x = "Time", y = "Number of occurences") + theme_bw() ``` It seems like good separability of clusters, better than in the case of DFT. In this tutorial, I showed you the usage of time series representation methods to create more accurate characteristic profiles of consumers. ## Bibliography