--- title: "Introduction to reviser" output: rmarkdown::html_vignette bibliography: references.bib biblio-style: apalike link-citations: true vignette: > %\VignetteIndexEntry{Introduction to reviser} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The `reviser` package provides tools to manipulate and analyze vintages of time series data subject to revisions. This vignette demonstrates how to get started and structure your data according to the package’s conventions. ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, warning=FALSE, message=FALSE} # The packages used in this vignette library(reviser) library(dplyr) library(tsbox) ``` ## Package conventions --- The most conventional way is to represent vintage data in a real-time data matrix, where each **row** represents a time period and each **column** represents successive releases of the data. This known as the **wide format**. The package supports data in **wide format**. and assumes that the data is organized in the following columns: - `time`, the time period - Publication dates in `'yyyy-mm-dd'` format or release numbers as `release_#` While wide format is practical for inspection, data manipulation is often easier in **long (tdy) format**, which consists of: - `time`, the time period - `pub_date` and/or `release`, the publication date or release number - `value`, the reported value. - `id`, an optional column to distinguish between different series For illustration, the package provides a dataset in **long format**, `gdp`. Below, we examine GDP growth rates for the US and Euro Area during the 2007–2009 financial crisis. ```{r} # Example long-format US GDP data data("gdp") gdp_us_short <- gdp |> dplyr::filter(id == "US") |> ts_pc() |> filter( pub_date >= as.Date("2007-01-01"), pub_date < as.Date("2009-01-01"), time >= as.Date("2007-01-01"), time < as.Date("2009-01-01") ) # Example long-format EA GDP data gdp_ea_short <- gdp |> dplyr::filter(id == "EA") |> ts_pc() |> filter( pub_date >= as.Date("2007-01-01"), pub_date < as.Date("2009-01-01"), time >= as.Date("2007-01-01"), time < as.Date("2009-01-01") ) head(gdp_ea_short) ``` ## Convert Long to Wide Format --- To transform a dataset from **long format** to **wide format**, use `vintages_wide()`. The function requires columns `time` and `value`, along with either `pub_date` or `release`. An optional `id` column can be used to distinguish between multiple series. ```{r} # Convert wide-format data to long format wide_ea_short <- vintages_wide(gdp_ea_short) head(wide_ea_short) ``` ## Convert Wide to Long Format --- To revert to **long format**, use `vintages_long()`. The function expects column names in **wide format** to be valid dates or contain the string `"release"`. ```{r} # Convert back to long format long_ea_short <- vintages_long(wide_ea_short) long_ea_short ``` ## Handling Multiple Series with `id` --- If an `id` column is present, `vintages_wide()` returns a **list** with one dataset per unique `id`. Conversely, `vintages_long()` maintains the `id` column to distinguish between series. ```{r} gdp_short <- bind_rows( gdp_ea_short |> mutate(id = "EA"), gdp_us_short |> mutate(id = "US") ) gdp_wide_short <- vintages_wide(gdp_short) head(gdp_wide_short) ``` ## Extracting Releases --- Once data follows the package conventions, it can be analyzed further. A common task is assessing the **first release** of data, which corresponds to the diagonal of the real-time data matrix. Use `get_nth_release()` to extract the **nth** release. This function is **0-indexed**, so the first release corresponds to `n = 0`. ```{r} # Get the first release and check in wide format gdp_releases <- get_nth_release(gdp_short, n = 0) vintages_wide(gdp_releases) # The function uses the pub_date column by default to define columns in wide # format. Specifying the `names_from` argument allows to use the release column. gdp_releases <- get_nth_release(gdp_short, n = 0:1) vintages_wide(gdp_releases, names_from = "release") ``` To assess data accuracy, we need to define the **final release**. Since many statistical agencies continue revising data indefinitely, the **latest release** is often used as a benchmark. Use `get_nth_release(n = "latest")` to extract the most recent vintage. ```{r} # Get the latest release gdp_final <- get_nth_release(gdp_short, n = "latest") vintages_wide(gdp_final) ``` Some agencies **fix** their data after a certain period (e.g., Germany finalizes GDP data in August four years after the initial release). The function `get_fixed_release()` extracts these fixed releases. ```{r} gdp_ea_longer <- gdp |> dplyr::filter(id == "EA") |> ts_pc() |> filter( time >= as.Date("2000-01-01"), time < as.Date("2006-01-01"), pub_date >= as.Date("2000-01-01"), pub_date <= as.Date("2006-01-01") ) # Get the release from October four years after the initial release gdp_releases <- get_fixed_release( gdp_ea_longer, years = 4, month = "October" ) gdp_releases ``` ## Visualizing Vintage Data --- The `reviser` package provides simple and flexible tools for visualizing real-time vintages. The primary function for this is `plot_vintages()`, which supports multiple plot types, including line plots, scatter plots, bar plots, and boxplots. It returns a `ggplot2` object, allowing for further customization with the `ggplot2` package. This is a simple function allowing to visualize real-time data in a few lines of code. However, it is only possible to plot the data along one dimension (either `pub_date`, `release` or `id`). To use `plot_vintages()`, provide a data frame containing: - A `time` column (representing the observation period), - A `value` column (containing the reported data), - A column indicating the publication date (`pub_date`) or release number (`release`), which determines the dimension along which the data is visualized. For example, to visualize how GDP estimates evolved over time, you can create a line plot comparing different vintages: ```{r} # Line plot showing GDP vintages over the publication date dimension plot_vintages( gdp_us_short, title = "Real-time GDP Estimates for the US", subtitle = "Growth Rate in %" ) # Line plot showing GDP vintages over the release dimension gdp_releases <- get_nth_release(gdp_us_short, n = 0:3) plot_vintages(gdp_releases, dim_col = "release") ``` By default, if `dim_col` (the dimension along which vintages are plotted) contains more than 30 unique values, only the most recent 30 are displayed to maintain readability. ```{r} # Line plot showing GDP vintages over the publication date dimension plot_vintages( gdp_ea_longer, type = "boxplot", title = "Real-time GDP Estimates for the Euro Area", subtitle = "Growth Rate in %" ) # Line plot showing GDP vintages over id dimension plot_vintages( gdp |> ts_pc() |> get_latest_release() |> na.omit(), dim_col = "id", title = "Recent GDP Estimates", subtitle = "Growth Rate in %" ) ``` For further customization, you can apply custom themes and color scales using: - `scale_color_reviser()` - `scale_fill_reviser()` - `theme_reviser()` These functions ensure a consistent visual style tailored for vintage data analysis. ## Analyzing Data Revisions and Releases --- After defining the final release, we can analyze revisions and releases in multiple ways: - Calculate revisions: `get_revisions()`. See vignette [Understanding Data Revisions](understanding-revisions.html) for more details. - Analyze the revisions: `get_revision_analysis()`. See vignette [Revision Patterns and Statistics](revision-analysis.html) for more details. - Identify the first efficient release: `get_first_efficient_release()`. See vignette [Efficient Release Identification](efficient-release.html) for more details. - Nowcast future revisions: `kk_nowcast()` and `jvn_nowcast()`. See vignettes [Nowcasting revisions using the generalized Kishor-Koenig family](nowcasting-revisions-kk.html) and [Nowcasting revisions using the Jacobs-Van Norden model](nowcasting-revisions-jvn.html) for more details.