Introduction to danstat

Introduction

This vignette introduces the danstat package by means of a simple use case to illustrate the data discovery process. The package is designed to give easy and intuitive access to the Danmarks Statistik (Statistics Denmark) Statistikbank API which allows access to all published data in StatBank. As a public institution, Statistics Denmark provides open access to the data, so the API doesn’t require authentication and has no rate limits. The data is free to re-use and reproduce even for commercial use, however, source accreditation to Statistics Denmark should be given. See terms for the Use and reuse of data.

library(danstat)
library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows

Exploring a subject

In this example we would like to examine alcohol related incidents and the consumption of alcohol in Denmark. An inital call to get_subjects() returns the following data frame:

get_subjects()
#>    id            description active hasSubjects subjects
#> 1   1                 People   TRUE        TRUE     NULL
#> 2   2      Labour and income   TRUE        TRUE     NULL
#> 3   3                Economy   TRUE        TRUE     NULL
#> 4   4      Social conditions   TRUE        TRUE     NULL
#> 5   5 Education and research   TRUE        TRUE     NULL
#> 6   6               Business   TRUE        TRUE     NULL
#> 7   7              Transport   TRUE        TRUE     NULL
#> 8   8    Culture and leisure   TRUE        TRUE     NULL
#> 9   9 Environment and energy   TRUE        TRUE     NULL
#> 10 19                  Other   TRUE        TRUE     NULL

The subjects of interest for us are “Business” and “Transport”. We look into more details into these to subjects:

subj <- get_subjects(subjects = c("6","7"))
subsubjects <- subj$subjects %>% bind_rows()
subsubjects
#>      id                            description active hasSubjects subjects
#> 1  3456                  Structure of business   TRUE        TRUE     NULL
#> 2  3457                       Business economy   TRUE        TRUE     NULL
#> 3  3458              International enterprises   TRUE        TRUE     NULL
#> 4  3460 Agriculture, horticulture and forestry   TRUE        TRUE     NULL
#> 5  3461                Fishery and aquaculture   TRUE        TRUE     NULL
#> 6  3462               Manufacturing industries   TRUE        TRUE     NULL
#> 7  3463                           Construction   TRUE        TRUE     NULL
#> 8  3465                    Distributive trades   TRUE        TRUE     NULL
#> 9  3466               Accommodation and travel   TRUE        TRUE     NULL
#> 10 3450                       Financial sector   TRUE        TRUE     NULL
#> 11 3467                         Service sector   TRUE        TRUE     NULL
#> 12 3459              Business tendency surveys   TRUE        TRUE     NULL
#> 13 3440                     Means of transport   TRUE        TRUE     NULL
#> 14 3469                         Infrastructure   TRUE        TRUE     NULL
#> 15 3115                    Passenger transport   TRUE        TRUE     NULL
#> 16 3464                        Goods transport   TRUE        TRUE     NULL
#> 17 3413                      Traffic accidents   TRUE        TRUE     NULL

Getting table information and variables

Now we would like to see what data tables are available for subjects “Distributive trades” and “Traffic accidents”:

tables <- get_tables(subjects = c("3465", "3413")) 
tables %>% 
  select(id, text, variables) %>% 
  kable()
id text variables
DETA151 Retail Trade Index industry (DB07), time
DETA152 Retail Trade Index commodity group, index type , time
GTIN Retail of groceries calculated on the basis of transaction data commodity group, unit , time
ALKO4 Consumption and sales of alcohol and tobacco, subject to excises duties type, time
ALKO3 Consumption and sales of alcohol, subject to excises duties type, time
ALKO2 Consumption and sales of alcohol and tobacco, subject to excises duties, by pop. type, time
UHELD3 Road traffic accidents type of accident , accident situation, urban area , speed limit , time
UHELD4 Road traffic accidents type of accident , means of transport, hour , day of the week , month , time
UHELD5 Road traffic accidents type of accident , accident situation , means of transport , other units involved, time
UHELD6 Road traffic accidents type of accident , accident situation , means of transport , number of transport units involved, time
UHELDK7 Road traffic accidents type of accident , region , urban area , accident situation, time
UHELD8 Injured and killed in road traffic accidents type of accident , casualty , means of transport, sex , age , type of injury , time
UHELD9 Injured and killed in road traffic accidents type of accident , casualty , means of transport, type of person , urban area , type of road , time
UHELD10 Injured and killed in road traffic accidents type of accident , accident situation , casualty , means of transport , other units involved, time
UHELD11 Injured and killed in road traffic accidents type of accident , casualty , means of transport , position , type of person , influenced by alcohol, time
UHELDK1 Injured and killed in road traffic accidents region , casualty , motor vehicles involved, age , sex , time
UHELDK2 Injured and killed in alcohol accidents region , casualty, age , sex , time
UHELD12 Drivers and pedestrians means of transport , sex , age , age of driving license , level of alcohol in blood, time
MOERKE Injured in road traffic accidents reported by the police and casualty wards reporter , accident situation, means of transport, sex , age , type of injury , time
MOERKE1 Injured in road traffic accidents reported by casualty wards reporter , accident situation , means of transport , sex , age , casualty (classified by diagnosis), time
BANE91 Fatalities and injuries in railway traffic accidents railway system , category of person, casualty , time
BANE92 Fatalities and injuries in railway traffic accidents railway system , type of accident, casualty , time
SKIB95 Accidents at sea involving danish vessels type of accident, type of vessel , scope , time
SKIB96 Accidents at sea in Danish sea territory type of accident , type of vessel , scope , land of registration, time

If we would like to examine number of traffic accidents by hour of day (which is an interesting question in itself) we can choose table “UHELD4”. To extract sales and consumption of alcohol, we will use table “ALKO3”. We can now look into the variable codes and values for these 2 tables and determine if we should filter any values.

vars_acc <- get_table_metadata(table_id = "uheld4", variables_only = TRUE)
vars_alco <- get_table_metadata(table_id = "alko3", variables_only = TRUE)

vars_acc %>% 
  select(id, text)
#>         id               text
#> 1   UHELDA   type of accident
#> 2 TRANSMID means of transport
#> 3     KLOK               hour
#> 4      UGE    day of the week
#> 5      MND              month
#> 6      Tid               time

vars_alco %>% 
  select(id, text)
#>     id text
#> 1 TYPE type
#> 2  Tid time

Let us look a bit into the values for the “UHELDA” (type of accident) variable from table “UHELD4” and the “TYPE” variable from table “ALKO3” to determine what the possible values of these are.

vars_acc$values[1] %>% 
  kable()
id text
0000 All accidents
1000 Accidents involving persons influenced by alcohol
2000 Others accidents

vars_alco$values[1] %>% 
  kable()
id text
09 Sales of lager equivalents, total excl. tax free beer (index 2000=100) (2000-)
010 Sales of wine, total (indeks 2000=100)
011 Sales of spirits, total (indeks 2000=100)
0111 Sales of alcopops, total (indeks 2006=100)(2006-)
055 Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000-
060 Consumption of wine, total (index 2000=100) (2000-)
065 Consumption of spirits, total (index 2000=100) (2000-)
070 Consumption of alcopops, total (index 2006=100) (2006-)

To compare the diurnal patterns of alcohol-related accidents vs. other accidents we will select values c(1000, 2000) from the “UHELDA” variable and values c("09", "055") (sales and consumption of lager equivalents) from the “TYPE” variable. As we would like data for all time periods (“Tid”) we will leave the values for that variable as NA.

Pulling data from the API

We can now pull the needed data from the API - first the accident numbers:

variable_codes <- vars_acc$id[c(1, 3, 6)] # UHELDA, KLOK and Tid
variable_values <- list(c(1000, 2000), NA, NA) # all values for KLOK and Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
accidents <- get_data("uheld4", variables = variable_input)
#> Rows: 1200 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): UHELDA, KLOK
#> dbl (2): TID, INDHOLD
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(accidents) %>% kable()
UHELDA KLOK TID INDHOLD
Accidents involving persons influenced by alcohol 00-01 1997 94
Accidents involving persons influenced by alcohol 00-01 1998 85
Accidents involving persons influenced by alcohol 00-01 1999 81
Accidents involving persons influenced by alcohol 00-01 2000 87
Accidents involving persons influenced by alcohol 00-01 2001 77
Accidents involving persons influenced by alcohol 00-01 2002 91

and then the alcohol sales and consumption values:

variable_codes <- vars_alco$id 
variable_values <- list(c("055", "09"), NA) # All values for Tid

# Construct the variable_input as a list of code-values pairs
variable_input <- purrr::map2(.x = variable_codes, .y = variable_values, .f = ~list(code = .x, values = .y))

# Get data 
alcohol <- get_data("alko3", variables = variable_input)
#> Rows: 132 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): TYPE, INDHOLD
#> dbl (1): TID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
alcohol %>% 
  filter(INDHOLD != "..") %>% # the API returns ".." as missing values
  head() %>% 
  kable()
TYPE TID INDHOLD
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2000 100
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2001 97
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2002 96
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2003 98
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2004 93
Consumption of lager equivalents, total, excl. tax free beer (index 2000=100) (alc. 4,6 vol.) (2000- 2005 93

Analysis 1: Duirnal patterns of traffic accidents

An interesting analysis would be to see at which time of day most accidents occur:

accidents_by_hour <- accidents %>%
    filter(KLOK != "Not stated") %>%
    group_by(UHELDA, KLOK) %>%
    summarise(mean_accidents = mean(INDHOLD, na.rm = TRUE))
#> `summarise()` has grouped output by 'UHELDA'. You can override using the
#> `.groups` argument.

accidents_by_hour %>%
    ggplot(aes(x = KLOK, y = mean_accidents, color = UHELDA, group = UHELDA)) +
    geom_line() +
    geom_point() +
    theme_bw() + 
  theme(legend.position="top") +
  labs(x = "Time of day", y = "Average annual accidents")

A pattern is quite clear: non-alcohol related accidents peak during commute hours in the morning and the afternoon, while alcohol related accidents are highest and almost equally likely between 6pm and 3am. It is important to note here that the values on the y-axis are total annual accidents for the average year between 1997 and 2018. So a value of e.g. 50, means that of all accidents in an average year 50 occurred in a given hour (summed over all days in the year) - so these are not average hourly accidents.