Deployment of star databases with incremental refresh

Introduction

Once we developed a star database in R, we would like to exploit it directly in R to develop multidimensional queries, but that is part of a future project. Currently we may be interested in deploying it in a RDBMS (Relational Database Management Systems) to use it with OLAP (On-Line Analytical Processing) query tools.

The deployment process consists of exporting the tables that make up the star databases to the RDBMS and also keeping them updated.

The vignette titled Incremental refresh of star databases, vignette("v40-refresh"), describes how star databases can be periodically updated based on the new data obtained. These updates are transmitted to the RDBMS automatically using the procedure described in this document.

This document shows by means of an example the possibilities offered by the package in this context. First, the star database and the deployment processes are presented. The next section shows how the incremental refresh is applied and the result obtained in the relational databases where the deployments have been carried out. Finally, it finish with the conclusions.

Star databases and deployments

This section shows the starting star databases and their deployment process.

Star databases

The starting star databases is the content in the variable mrs_db, obtained in the vignette titled Obtaining and transforming flat tables, vignette("v05-flat-table-op"). It contains the constellation, formed by two star databases. Next we get their names.

library(rolap)

mrs_db |> 
  get_fact_names()
#> [1] "mrs_age"   "mrs_cause"

The code to generate the constellation from the initial data is available in the vignette. Below is a graphic representation of the tables that make it up.

db_dm <- mrs_db |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

Deployment

To deploy, we need a file to store the star database. In a real case, it would be a file under our control in our folder tree: we only need a file name. For the vignette, let’s get a temporary one.

mrs_rdb_file <- tempfile()

We are going to carry out the deployment on SQLite. We will also need a file to store the database. As before, if we use this RDBMS, we would use a file under our control. For the vignette we will use a temporary one.

mrs_sqlite_file <- tempfile("mrs", fileext = ".sqlite")

The deployment functions must be able to access the database that we use. To give them access to the database, we provide a connection function (it must return an object of class DBIConnection) and a disconnection function (it must accept a parameter of class DBIConnection).

mrs_sqlite_connect <- function() {
  DBI::dbConnect(RSQLite::SQLite(),
                 dbname = mrs_sqlite_file)
}

mrs_sqlite_disconnect <- function(con) {
  DBI::dbDisconnect(con)
}

If the disconnection function is like the previous one, we could avoid indicating it because it is the default one.

With these parameters, we deploy with the deploy function, as shown below.

mrs_db <- mrs_db |>
  deploy(
    name = "mrs_sqlite",
    connect = mrs_sqlite_connect,
    disconnect = mrs_sqlite_disconnect,
    file = mrs_rdb_file
  )

As a result, fact and dimension tables are explored to the database. The star_database object is stored (in the file indicated) so that corresponds to the version stored in the relational database. Additionally, from this moment on, the star database and the relational database are linked: refresh operations to the star databases will automatically be propagated to the relational database.

Contents of the constellation and the relational database

Next, we are going to show the content of the star databases and the associated relational database.

For facts and dimensions the number of instances in the star databases is shown.

l_db <- mrs_db |>
  as_tibble_list()

names <- sort(names(l_db))
for (name in names){
  cat(sprintf("name: %s, %d rows\n", name, nrow(l_db[[name]])))
}
#> name: mrs_age, 18228 rows
#> name: mrs_cause, 3677 rows
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows

We get the same information for the relational database.

mrs_con <- mrs_sqlite_connect()

tables <- DBI::dbListTables(mrs_con)
for (t in tables) {
  res <- DBI::dbGetQuery(mrs_con, sprintf('SELECT COUNT(*) FROM `%s`', t))
  cat(sprintf("name: %s, %d rows\n", t, res[[1]]))
}
#> name: mrs_age, 18228 rows
#> name: mrs_cause, 3677 rows
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows

mrs_sqlite_disconnect(mrs_con)

We can see that the tables and their number of instances are the same in both cases.

Another deployment

We can perform more than one deployment associated with a star database. Next, we get a second temporary file for the new SQLite relational database.

mrs_sqlite_file_2 <- tempfile("mrs", fileext = ".sqlite")

We need a new connection function to use the new file. We define it below.

mrs_sqlite_connect_2 <- function() {
  DBI::dbConnect(RSQLite::SQLite(),
                 dbname = mrs_sqlite_file_2)
}

In this case we are going to use the default disconnection function and, since a previous deployment has already been carried out, we do not need to indicate any file to store the star databases because it is already being stored in one. The call to the deployment function is as shown below.

mrs_db <- mrs_db |>
  deploy(
    name = "mrs_sqlite_2",
    connect = mrs_sqlite_connect_2
  )

We can consult the current deployments using the following function.

mrs_db |>
  get_deployment_names()
#> [1] "mrs_sqlite"   "mrs_sqlite_2"

If necessary, we can also cancel a deployment using the cancel_deployment() function: The database is not affected, but it will not be updated with subsequent updates.

Incremental refresh

This section shows how incremental refresh updates are transferred from the star databases to the relational databases included in the deployments carried out.

To perform the incremental update, we take the data and process presented in vignette Incremental refresh of star databases, vignette("v40-refresh").

New data

The update data is stored in the form of a flat table in the package’s mrs_ft_new variable. Below are the first records of the table. We access the table using the get_table() function for the object of the flat_table class.

ft <- mrs_ft_new |> 
  get_table()

ft
#> # A tibble: 375 × 13
#>    Year  WEEK  `Week Ending Date` REGION State City       Pneumonia and Influe…¹
#>    <chr> <chr> <chr>              <chr>  <chr> <chr>      <chr>                 
#>  1 1962  48    12/01/1962         2      NY    Buffalo    5                     
#>  2 1963  3     01/19/1963         4      IA    Des Moines 2                     
#>  3 1963  6     02/09/1963         8      CO    Pueblo     0                     
#>  4 1963  7     02/16/1963         7      TX    El Paso    0                     
#>  5 1963  25    06/22/1963         1      MA    Springfie… 5                     
#>  6 1964  10    03/07/1964         1      MA    Cambridge  3                     
#>  7 1964  12    03/21/1964         8      CO    Colorado … 1                     
#>  8 1964  20    05/16/1964         3      IL    Chicago    34                    
#>  9 1964  20    05/16/1964         8      CO    Pueblo     0                     
#> 10 1964  27    07/04/1964         3      OH    Akron      0                     
#> # ℹ 365 more rows
#> # ℹ abbreviated name: ¹​`Pneumonia and Influenza Deaths`
#> # ℹ 6 more variables: `All Deaths` <chr>, `<1 year (all cause deaths)` <chr>,
#> #   `1-24 years (all cause deaths)` <chr>, `25-44 years` <chr>,
#> #   `45-64 years (all cause deaths)` <chr>,
#> #   `65+ years (all cause deaths)` <chr>

Incremental refresh process

We apply the modification process carried out on the star databases to the data in the flat table. This process is stored in the star_database object itself so we do not have to search for the applied functions.

mrs_db_age_refresh <- mrs_ft_new |>
  update_according_to(mrs_db, star = "mrs_age")

mrs_db_cause_refresh <- mrs_ft_new |>
  update_according_to(mrs_db, star = "mrs_cause")

The result is that we replicate the structure of each of the components of the constellation on the new data. With this structure we can now carry out the refresh process itself for each of the star databases.

mrs_db <- mrs_db |>
  incremental_refresh(mrs_db_age_refresh) |>
  incremental_refresh(mrs_db_cause_refresh, existing_instances = "group")

In each case, using the existing_instances parameter, we can decide what to do with the instances that appear in the update and that were already included in the star database.

Incremental refresh result

To check the result obtained, we are going to obtain again the name of the tables and the number of instances for both the star databases and each of the deployments.

Star database

l_db <- mrs_db |>
  as_tibble_list()

names <- sort(names(l_db))
for (name in names){
  cat(sprintf("name: %s, %d rows\n", name, nrow(l_db[[name]])))
}
#> name: mrs_age, 18228 rows
#> name: mrs_cause, 3677 rows
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows

First deployment

mrs_con <- mrs_sqlite_connect()

tables <- DBI::dbListTables(mrs_con)
for (t in tables) {
  res <- DBI::dbGetQuery(mrs_con, sprintf('SELECT COUNT(*) FROM `%s`', t))
  cat(sprintf("name: %s, %d rows\n", t, res[[1]]))
}
#> name: mrs_age, 18228 rows
#> name: mrs_cause, 3677 rows
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows

mrs_sqlite_disconnect(mrs_con)

Second deployment

mrs_con_2 <- mrs_sqlite_connect_2()

tables <- DBI::dbListTables(mrs_con_2)
for (t in tables) {
  res <- DBI::dbGetQuery(mrs_con_2, sprintf('SELECT COUNT(*) FROM `%s`', t))
  cat(sprintf("name: %s, %d rows\n", t, res[[1]]))
}
#> name: mrs_age, 18228 rows
#> name: mrs_cause, 3677 rows
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows

mrs_sqlite_disconnect(mrs_con_2)

We can observe that in all three cases the number of instances is the same and has changed with respect to the situation prior to carrying out the refresh operations.

Checking star database copy synchronization

In addition to updating the deployments in the relational databases, the copy of the star databases is automatically kept updated in its file. We can load it into a variable using the load_star_database() function and check that its content is identical to that of the original star database.

mrs_db_new <- load_star_database(mrs_rdb_file)

identical(mrs_db, mrs_db_new)
#> [1] TRUE

If the file has extension rds, the readRDS() function could also be used directly.

The star_database object contained in the new variable is fully operational.

Conclusions

This document shows the functions supporting the deployment in relational databases and the automatic incremental refresh of these offered by the rolap package.

If we need to perform OLAP analysis using tools that can obtain their data from relational databases, this functionality allows R developers to perform transformations in R in an even more productive way instead of having to work with other tools.