Definition process

The definition process consists of the following steps:

Creating an object.
Definition of general aspects.
Definition of instances.
Selection of levels and attributes.
Generation of the result.
Getting the dimension table.

Steps 2, 3 and 4 can be done in any order, they can even be included in step 1.

Below we discuss each of them.

Creating an object

The definition process begins by creating an object of the when class.

library(when)

w_date <- when()

By default, an object is created based on date, but we can also create it based on time by indicating it using the type parameter.

w_time <- when(type = 'time')

Virtually all configuration options can be defined using the object creation function. They can also be defined later using specific functions, as shown in the following section.

Definition of general aspects

For example, using the following combination of functions we get the same result as defining the type when we create the object.

w_time_2 <- when() |>
  define_characteristics(type = 'time')

identical(w_time, w_time_2)
#> [1] TRUE

By default it considers the system language. Since our system is Spanish, for the literals of day and month names to appear in English, we must configure it explicitly.

w_date <- w_date |>
  define_characteristics(locale = Sys.setlocale("LC_TIME", "English"))

In section Other configuration options we present the rest of the configuration options for dimension generation. For the examples that we are going to use, the default options of the package are enough.

Next we have to select and configure the levels and define the instances. These operations, along with the current one, can be performed in any order.

Definition of instances

We can define the instances by giving a range of values or by indicating the specific instances. The values in both cases will depend on the type of dimension (date-based or time-based).

We can indicate these values at the time of creating the class object or through the define_instances() function.

In the case of date, we can indicate a year, year and month (with the format yyyy-mm), or year month and day (yyyy-mm-dd).
In the case of time, we can indicate an hour, hour and minute (with the format hh:mm), or hour, minute and second (hh:mm:ss).

In both cases we can use the string format, the one provided by the lubridate package or, when it is a single number (year or hour), we can use an integer value.

For example, below they are used with the format of lubridate.

w_date <- w_date |>
  define_instances(start = lubridate::today(),
                   end = lubridate::today() + lubridate::years(5))

In the following definitions we configure the dimensions so that they consider only the indicated dates. The second will be appropriate if we want to obtain the dimension at the year level, because we are indicating the specific values to consider (not a period).

w_date_2_1 <-
  when(
    values = c(
      "2023-12-31",
      "2023-01-01",
      "2022-12-31",
      "2022-01-01",
      "2021-12-31",
      "2021-01-01"
    )
  )

w_date_2_2 <- w_date |>
  define_instances(values = 2020:2030)

The following definition will be appropriate at any level of detail because we are giving the range of values to consider (dates between those two years).

w_date_3 <- w_date |>
  define_instances(start = 2020, end = 2030)

When only the year is indicated, it is considered its first day. Therefore the previous definition is equivalent to the following one.

w_date_4 <- w_date |>
  define_instances(start = "2020-01-01", end = "2030-01-01")

identical(w_date_3, w_date_4)
#> [1] TRUE

In the case of time, by default, it is considered at second level and all seconds of the day.

w_time_3 <- w_time |>
  define_instances(start = "00:00:00", end = "23:59:59")

identical(w_time, w_time_3)
#> [1] TRUE

We can also indicate a period or a set of times according to the indicated criteria. In the following example, the two definitions are equivalent.

w_time_4 <- w_time |>
  define_instances(start = 8, end = 17)

w_time_5 <- w_time |>
  define_instances(start = "08:00:00", end = "17:00:00")

identical(w_time_4, w_time_5)
#> [1] TRUE

Selection of levels and attributes

In the definition, by default, the attributes that we have considered to be most frequently used have been included. We can consult them for each level using the get_level_attribute_names() function with the parameter selected = TRUE. If we define the parameter as selected = FALSE (default option), it will return all available attributes. We can indicate the name of the level or, if we do not indicate any, it will obtain the attributes of all the levels. We obtain the level names using the get_level_names() function. Below are examples of using these functions for the date-based dimension.

w_date |>
  get_level_attribute_names(selected = TRUE)
#>  [1] "date"           "month_day"      "week_day"       "day_name"      
#>  [5] "day_num_name"   "year_week"      "week"           "year_month"    
#>  [9] "month"          "month_name"     "month_num_name" "year"

w_date |>
  get_level_names()
#> [1] "day"      "week"     "month"    "quarter"  "semester" "year"

w_date |>
  get_level_attribute_names(name = 'month', selected = TRUE)
#> [1] "year_month"     "month"          "month_name"     "month_num_name"

w_date |>
  get_level_attribute_names(name = 'month')
#> [1] "year_month"     "month"          "month_name"     "month_num_name"
#> [5] "month_abbr"     "month_num_abbr"

Also for the time-based dimension.

w_time |>
  get_level_attribute_names()
#> [1] "time"     "hour"     "minute"   "second"   "day_part"

w_time |>
  get_level_names()
#> [1] "time"

There is a function to configure each level. We can select or deselect each attribute individually using the attribute name as a boolean parameter of the function. We can also select or deselect all attributes defined at the level using the special include_all and exclude_all parameters: Once all the attributes have been updated with these special parameters, in the same function call, we can select or deselect the attributes we want.

For example, if we do not want to include the name of the month, we can configure it using the select_month_level() function or at the time of object creation with a parameter of the same name, as shown below.

w_date_5 <- w_date |>
  select_month_level(month_name = FALSE)

w_date_6 <- when(
  start = lubridate::today(),
  end = lubridate::today() + lubridate::years(5),
  month_name = FALSE
)

identical(w_date_5, w_date_6)
#> [1] TRUE

w_date_5 |>
  get_level_attribute_names(name = 'month', selected = TRUE)
#> [1] "year_month"     "month"          "month_num_name"

If we only want to include the name of the month, we can do it as follows.

w_date_7 <- w_date |>
  select_month_level(exclude_all = TRUE, month_name = TRUE)

w_date_7 |>
  get_level_attribute_names(name = 'month', selected = TRUE)
#> [1] "month_name"

There is available a similar function for each level: time, date, week, month, quarter, semester and year.

In the case of the date-based dimension levels, we can select or deselect them directly using the select_date_levels() function. It allows us to treat the levels globally in the same way that we have been treating the attributes. For example, to exclude the month level we can do the following.

w_date_8 <- w_date |>
  select_date_levels(month_level = FALSE)

w_date_8 |>
  get_level_attribute_names(name = 'month', selected = TRUE)
#> character(0)

We can also perform that same operation at object creation time, as shown below.

w_date_9 <- when(
  start = lubridate::today(),
  end = lubridate::today() + lubridate::years(5),
  month_level = FALSE
)

In the case of a time-based dimension, as we have seen, we only consider one level called “time”, as shown again below.

w_time |>
  get_level_names()
#> [1] "time"

We can configure the attributes it includes using the select_time_level() function. The attribute called “hour” must always be included. If we do not include the “minute” attribute then “second” will not be included either. Below are some examples of definition and the result obtained.

w_time_6 <- w_time |>
  select_time_level(exclude_all = TRUE)

w_time_6 |>
  get_level_attribute_names(selected = TRUE)
#> [1] "hour"

w_time_7 <- w_time |>
  select_time_level(minute = FALSE)

w_time_7 |>
  get_level_attribute_names(selected = TRUE)
#> [1] "time"     "hour"     "day_part"

Generation of the result

Once the characteristics of the dimension are defined, we can generate its table.

Although we have not yet generated the dimension table, the following function returns the names of the attributes it will contain. They can be obtained as a string or as a vector of strings.

w_date |>
  get_table_attribute_names(as_string = FALSE)
#>  [1] "id"             "date"           "month_day"      "week_day"      
#>  [5] "day_name"       "day_num_name"   "year_week"      "week"          
#>  [9] "year_month"     "month"          "month_name"     "month_num_name"
#> [13] "year"

If it is generated in the form of a string (as_string = TRUE, default option), the result can be used to easily change the name of the attributes, when the table is generated, using the set_table_attribute_names() function, if deemed necessary, as shown in section Renaming the attributes.

When we consider that the dimension is already properly defined, we can generate its corresponding table using the generate_table() function. We update the definition of the object. Any errors found will be notified to us.

w_date <- w_date |>
  generate_table()

w_time <- w_time |>
  generate_table()

Once the table is generated, we can export it. If we want to access the table and it is not yet generated, it will send us an error message.

If we make changes to the configuration, it is our responsibility to generate the table again.

Getting the dimension table

We can access the dimension table to work with it in R (for example using the rolap package), or export it in the format that is most convenient for us to work with other tools.

Next we access the generated tables and display them. First of all for the date.

t_date <- w_date |>
  get_table()

The first and last rows of the obtained result are shown below.

pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)),
                     split.table = Inf)

id	date	month_day	week_day	day_name	day_num_name	year_week	week	year_month	month	month_name	month_num_name	year
1	2024-01-08	08	1	lunes	1-lunes	2024-02	02	2024-01	01	enero	01-enero	2024
2	2024-01-09	09	2	martes	2-martes	2024-02	02	2024-01	01	enero	01-enero	2024
3	2024-01-10	10	3	miércoles	3-miércoles	2024-02	02	2024-01	01	enero	01-enero	2024
4	2024-01-11	11	4	jueves	4-jueves	2024-02	02	2024-01	01	enero	01-enero	2024
5	2024-01-12	12	5	viernes	5-viernes	2024-02	02	2024-01	01	enero	01-enero	2024
1824	2029-01-04	04	4	jueves	4-jueves	2029-01	01	2029-01	01	enero	01-enero	2029
1825	2029-01-05	05	5	viernes	5-viernes	2029-01	01	2029-01	01	enero	01-enero	2029
1826	2029-01-06	06	6	sábado	6-sábado	2029-01	01	2029-01	01	enero	01-enero	2029
1827	2029-01-07	07	7	domingo	7-domingo	2029-01	01	2029-01	01	enero	01-enero	2029
1828	2029-01-08	08	1	lunes	1-lunes	2029-02	02	2029-01	01	enero	01-enero	2029

Let’s generate the dimension at the week level including all its attributes (if we change the definition, we have to generate the table again).

t_date <- w_date |>
  select_date_levels(day_level = FALSE) |>
  select_week_level(include_all = TRUE) |>
  generate_table() |>
  get_table()

The first and last rows of the obtained result are shown below.

pander::pandoc.table(rbind(head(t_date, 5), tail(t_date, 5)),
                     split.table = Inf)

id	year_week	week	year_month	month	month_name	month_num_name	year
1	2024-02	02	2024-01	01	enero	01-enero	2024
2	2024-03	03	2024-01	01	enero	01-enero	2024
3	2024-04	04	2024-01	01	enero	01-enero	2024
4	2024-05	05	2024-01	01	enero	01-enero	2024
5	2024-05	05	2024-02	02	febrero	02-febrero	2024
310	2028-51	51	2028-12	12	diciembre	12-diciembre	2028
311	2028-52	52	2028-12	12	diciembre	12-diciembre	2028
312	2028-53	53	2028-12	12	diciembre	12-diciembre	2028
313	2029-01	01	2029-01	01	enero	01-enero	2029
314	2029-02	02	2029-01	01	enero	01-enero	2029

Next for the time.

t_time <- w_time |>
  get_table()

The start and end of the result table is shown below.

pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)),
                     split.table = Inf)

id	time	hour	minute	second	day_part
1	00:00:00	00	00	00	Night
2	00:00:01	00	00	01	Night
3	00:00:02	00	00	02	Night
4	00:00:03	00	00	03	Night
5	00:00:04	00	00	04	Night
86396	23:59:55	23	59	55	Night
86397	23:59:56	23	59	56	Night
86398	23:59:57	23	59	57	Night
86399	23:59:58	23	59	58	Night
86400	23:59:59	23	59	59	Night

day_part is predefined in English. Literals or associated hours can be accessed and changed using the get_day_part() and set_day_part() functions, as shown in section Names of the parts of the day.

Now we are going to generate it at the minute level.

t_time <- w_time |>
  select_time_level(second = FALSE) |>
  generate_table() |>
  get_table()

The start and end from the new result table is shown below.

pander::pandoc.table(rbind(head(t_time, 5), tail(t_time, 5)),
                     split.table = Inf)

id	time	hour	minute	day_part
1	00:00:00	00	00	Night
2	00:01:00	00	01	Night
3	00:02:00	00	02	Night
4	00:03:00	00	03	Night
5	00:04:00	00	04	Night
1436	23:55:00	23	55	Night
1437	23:56:00	23	56	Night
1438	23:57:00	23	57	Night
1439	23:58:00	23	58	Night
1440	23:59:00	23	59	Night

We can take the generated tables to any Relational DBMS, as shown below, we just have to pass a connection to the database as a parameter.

my_db <- DBI::dbConnect(RSQLite::SQLite())

w_date |>
  get_table_rdb(my_db)

w_time |>
  get_table_rdb(my_db)

DBI::dbListTables(my_db)
#> [1] "date" "time"

DBI::dbDisconnect(my_db)

In the previous example, since we have not explicitly defined any names for the tables in the configuration, they have been assigned as a name the element on which they are based (date or time). We can define these names, as shown in section Table name.

Functions to get the tables in xlsx and csv format are also available: get_table_xlsx() and get_table_csv().

Other configuration options

In section Definition of general aspects we have presented the necessary configuration options for the examples we have used. In particular, we have only needed to define the type on which the dimension is based (time or date) and the language for naming days and months (because it is different from the language of our operating system). The default options have been adequate for the examples shown.

In this section we are going to show other configuration options that may be necessary to define our dimensions.

Dimension table features

The result of a definition operation performed with the where package is a table of a dimension at a given level of granularity, with the attributes selected for that level or associated coarser levels. So far we have focused on level attributes and table instances. In this section we are going to configure other characteristics of it.

Table name

By default, the table is assigned the name of the base component used to generate the dimension: date or time. We have seen it in the tables that we have stored in the database, in the example of the section Getting the dimension table.

We can define the name of the dimension at the time of object creation or through the define_characteristics() function, as shown below.

wd_1 <- when(name = 'dim_where')

wd_2 <- when() |>
  define_characteristics(name = 'dim_where')

This name is the one that will be assigned to the database table where we store it or to the files we obtain. Below is the case for a relational database.

my_db <- DBI::dbConnect(RSQLite::SQLite())

wd_1 |>
  generate_table() |>
  get_table_rdb(my_db)

DBI::dbListTables(my_db)
#> [1] "dim_where"

DBI::dbDisconnect(my_db)

Also the case of obtaining it in csv format (if we do not indicate a location, it stores it in one by default).

wd_2 |>
  generate_table() |>
  get_table_csv()
#> [1] "C:\\Users\\joses\\AppData\\Local\\Temp\\Rtmp6XNQKs/dim_where.csv"

Surrogate key

By default the dimension table will include a surrogate key. In some situations we may be interested in the table having the date itself as its primary key (for example if we want to partition the fact table by the date).

Through the surrogate_key parameter, included in the define_characteristics() function and also the when() function, we can indicate that a surrogate key is not created, as it’s shown in the following.

when() |>
  get_table_attribute_names(as_string = FALSE)
#>  [1] "id"             "date"           "month_day"      "week_day"      
#>  [5] "day_name"       "day_num_name"   "year_week"      "week"          
#>  [9] "year_month"     "month"          "month_name"     "month_num_name"
#> [13] "year"

when(surrogate_key = FALSE) |>
  get_table_attribute_names(as_string = FALSE)
#>  [1] "date"           "month_day"      "week_day"       "day_name"      
#>  [5] "day_num_name"   "year_week"      "week"           "year_month"    
#>  [9] "month"          "month_name"     "month_num_name" "year"

The name of the surrogate key is id, is the first attribute of the table. If necessary, we can rename the attributes, including the surrogate key as shown in the next section.

Renaming the attributes

If we get a table in tibble format, we can rename its attributes using the functions of this class. But if we want to obtain it in a file or in a database, we may be interested in renaming its attributes before obtaining it.

In order to rename the attributes, we first have to generate the table.

wd_3 <- when() |>
  generate_table()

We can rename them using the set_table_attribute_names() function, passing a name for each of the available attributes. To facilitate the operation, using the get_table_attribute_names() function, we can obtain the attributes in string format. With this we only have to replace the names that we want to change, as shown below.

wd_3 |>
  get_table_attribute_names()
#> [1] "c('id', 'date', 'month_day', 'week_day', 'day_name', 'day_num_name', 'year_week', 'week', 'year_month', 'month', 'month_name', 'month_num_name', 'year')"

wd_3 <- wd_3 |>
  set_table_attribute_names(
    c(
      'id_when',
      'date',
      'month_day',
      'week_day',
      'day_name',
      'day_num_name',
      'year_week',
      'week',
      'year_month',
      'month',
      'month_name',
      'month_num_name',
      'year'
    )
  )

wd_3 |>
  get_table_attribute_names(as_string = FALSE)
#>  [1] "id_when"        "date"           "month_day"      "week_day"      
#>  [5] "day_name"       "day_num_name"   "year_week"      "week"          
#>  [9] "year_month"     "month"          "month_name"     "month_num_name"
#> [13] "year"

Names of the parts of the day

We have associated each hour with the usual part of the day in English. We can consult them and change them if necessary (for example, to define them in another language) using the get_day_part() and set_day_part() functions.

Below is an example where the definition is modified to consider only two parts in the day.

when() |>
  get_day_part()
#>          00          01          02          03          04          05 
#>     "Night"     "Night"     "Night"     "Night"     "Night"   "Morning" 
#>          06          07          08          09          10          11 
#>   "Morning"   "Morning"   "Morning"   "Morning"   "Morning"   "Morning" 
#>          12          13          14          15          16          17 
#> "Afternoon" "Afternoon" "Afternoon" "Afternoon" "Afternoon"   "Evening" 
#>          18          19          20          21          22          23 
#>   "Evening"   "Evening"   "Evening"     "Night"     "Night"     "Night"

when() |>
  set_day_part(hour = c(20:23, 0:5), name = "Night") |>
  set_day_part(hour = c(6:19), name = "Day") |>
  get_day_part()
#>      00      01      02      03      04      05      06      07      08      09 
#> "Night" "Night" "Night" "Night" "Night" "Night"   "Day"   "Day"   "Day"   "Day" 
#>      10      11      12      13      14      15      16      17      18      19 
#>   "Day"   "Day"   "Day"   "Day"   "Day"   "Day"   "Day"   "Day"   "Day"   "Day" 
#>      20      21      22      23 
#> "Night" "Night" "Night" "Night"

Configuration of day and week levels

This section shows the configuration options available for the day and week levels.

The first day of the week

For days, in addition to the language of the name of the day of the week, we can define the start day of the week. There are two possibilities: Monday or Sunday. This determines the numbering of the days.

We can define it using the parameter week_starts_monday (which by default has the value TRUE) in the functions define_characteristics() or when(), as shown below.

wd_1 <- when(week_starts_monday = FALSE)

wd_2 <- when() |>
  define_characteristics(week_starts_monday = FALSE)

Numbering of weeks in the year

Using the value of the type parameter (type = "date" or type = "time") in the define_characteristics() or when() functions we can indicate whether the dimension is based on date or time.

This parameter can additionally take two more values: “iso” and “epi”. Using these values (together with the “date” value) we can indicate the type of numbering of the weeks in the year. That is, the week number associated with each date depends on the type of date dimension selected:

Default (type = "date"): It numbers blocks of 7 days beginning on January 1 (regardless of what day of the week it is). The last week of the year will last less than 7 days.
ISO 8601 (type = "iso"): It numbers blocks of 7 days from Monday to Sunday. The first and last week of the year can contain days from the previous or next year.
Epidemiological (type = "epi"): This week is like ISO 8601 only that it considers that the week begins on Sunday.

For the first and last days of the year, the year associated with the week may be different from the year of the date, depending on the type selected (“iso” or “epi”).

The “year” attribute (alone or in combination with other attributes) displays the year associated with the date, regardless of the type selected. Only the “year_week” attribute displays the year of the week according to the selected type.

The next section shows how to obtain the attribute obtaining function and how to change it if we consider it necessary (for example, to get the different year types together).

Get date range of weeks

If we want to work with the dimension at the week level, we may be interested in having the start or end date for each week. To obtain them and be able to use them to generate the dimension, we have the get_week_date_range() function.

t <- get_week_date_range(start = "2024-01-01", end = "2029-12-31")

A period and the type of week numbering (“date”, “iso” or “epi”) are indicated (by default it is “date”). Returns the result in tibble form.

The start and end from the new result is shown below.

pander::pandoc.table(rbind(head(t, 5), tail(t, 5)),
                     split.table = Inf)

year_week	first	last
2024-01	2024-01-01	2024-01-07
2024-02	2024-01-08	2024-01-14
2024-03	2024-01-15	2024-01-21
2024-04	2024-01-22	2024-01-28
2024-05	2024-01-29	2024-02-04
2029-49	2029-12-03	2029-12-09
2029-50	2029-12-10	2029-12-16
2029-51	2029-12-17	2029-12-23
2029-52	2029-12-24	2029-12-30
2029-53	2029-12-31	2029-12-31

We can use the column of the dates that interest us to generate the dimension. Additionally, we leave only the date and the levels that interest us (week and year).

tw <- when(values = t[["last"]], month_level = FALSE) |>
  select_day_level(exclude_all = TRUE, date = TRUE) |>
  generate_table() |>
  get_table()

The start and end from the new result table is shown below.

pander::pandoc.table(rbind(head(tw, 5), tail(tw, 5)),
                     split.table = Inf)

id	date	year_week	week	year
1	2024-01-07	2024-01	01	2024
2	2024-01-14	2024-02	02	2024
3	2024-01-21	2024-03	03	2024
4	2024-01-28	2024-04	04	2024
5	2024-02-04	2024-05	05	2024
314	2029-12-09	2029-49	49	2029
315	2029-12-16	2029-50	50	2029
316	2029-12-23	2029-51	51	2029
317	2029-12-30	2029-52	52	2029
318	2029-12-31	2029-53	53	2029

Change the definition function of an attribute

Each attribute is defined by a specific function. All definition functions have the same structure: they receive a table, a column with the date or time values, possibly additional configuration parameters, and return the table to which a new column with the attribute name and values has been added, calculated from the values.

We can get the definition function of any attribute using the get_attribute_definition_function() function. Below is the function for the “year” attribute and also the function of the year and week combination, “year_week” attribute (it uses one of the additional parameters available, the dimension type).

wd <- when()

wd |>
  get_attribute_definition_function(name = "year")
#> function (table, values, ...) 
#> {
#>     table[["year"]] <- as.character(lubridate::year(values))
#>     table
#> }
#> <bytecode: 0x000001cf0e5cdc48>
#> <environment: namespace:when>

wd |>
  get_attribute_definition_function(name = "year_week")
#> function (table, values, ...) 
#> {
#>     dots <- list(...)
#>     type <- dots[["type"]]
#>     switch(type, iso = {
#>         year <- lubridate::isoyear(values)
#>         week <- lubridate::isoweek(values)
#>     }, epi = {
#>         year <- lubridate::epiyear(values)
#>         week <- lubridate::epiweek(values)
#>     }, {
#>         year <- lubridate::year(values)
#>         week <- lubridate::week(values)
#>     })
#>     table[["year_week"]] <- paste0(year, "-", sprintf("%02d", 
#>         week))
#>     table
#> }
#> <bytecode: 0x000001cf0e628150>
#> <environment: namespace:when>

In view of the functions, we can define one with the same structure and establish it as a new attribute definition function using the set_attribute_definition_function() function. For example, below shows how to change the year attribute definition function to get an additional attribute if the year associated with the week can be different from the year of the date.

f <- function(table, values, ...) {
  dots <- list(...)
  type <- dots[['type']]
  table[['year']] <- as.character(lubridate::year(values))
  if (type == 'iso') {
    table[['week_year']] <- as.character(lubridate::isoyear(values))
  } else if (type == 'epi') {
    table[['week_year']] <- as.character(lubridate::epiyear(values))
  }
  table
}

wd <- wd |>
  set_attribute_definition_function(name = "year", f)

In this case, what we do is add two columns for the year: one with the year of the date and another with the year of the week if this can be different from the first. We can see the result by changing the type of the dimension and generating the table, as shown below (the two year attributes have been added).

t <- wd |>
  define_characteristics(type = 'iso') |>
  generate_table() |>
  get_table()

names(t)
#>  [1] "id"             "date"           "month_day"      "week_day"      
#>  [5] "day_name"       "day_num_name"   "year_week"      "week"          
#>  [9] "year_month"     "month"          "month_name"     "month_num_name"
#> [13] "year"           "week_year"

Generating the When Dimension based on date and time tables

2024-01-08

Introduction