diseasystore: quick start guide

library(diseasystore)

Available diseasystores

To see the available diseasystores on your system, you can use the available_diseasystores() function.

available_diseasystores()
#> [1] "DiseasystoreEcdcRespiratoryViruses" "DiseasystoreGoogleCovid19"

This function looks for diseasystores on the current search path. By default, this will show the diseasystores bundled with the base package. If you have extended diseasystore with either your own diseasystores or from an external package, then attaching the package to your search path will allow it to show up as available.

Note: diseasystores are found if they are defined within packages named diseasystore* and are of the class DiseasystoreBase.

Each of these diseasystores may have their own vignette that further details their content, use and/or tips and tricks. This is for example the case with DiseasystoreGoogleCovid19.

Using a diseasystore

To use a diseasystore we need to first do some configuration. The diseasystores are designed to work with data bases to store the computed features in. Each diseasystore may require individual configuration as listed in its documentation or accompanying vignette.

For this Quick start, we will configure a DiseasystoreGoogleCovid19 to use a local SQLite data base Ideally, we want to use a faster, more capable, data base to store the features in. The diseasystores uses SCDB in the back end and can use any data base back end supported by SCDB.

ds <- DiseasystoreGoogleCovid19$new(
  target_conn = DBI::dbConnect(RSQLite::SQLite()),
  start_date = as.Date("2020-03-01"),
  end_date = as.Date("2020-03-15")
)

When we create our new diseasystore instance, we also supply start_date and end_date arguments. These are not strictly required, but make getting features for this time interval simpler.

Once configured we can query the available features in the diseasystore

ds$available_features
#>  [1] "n_population"    "age_group"       "country_id"      "country"        
#>  [5] "region_id"       "region"          "subregion_id"    "subregion"      
#>  [9] "n_hospital"      "n_deaths"        "n_positive"      "n_icu"          
#> [13] "n_ventilator"    "min_temperature" "max_temperature"

These features can be retrieved individually (using the start_date and end_date we specified during creation of ds):

ds$get_feature("n_hospital")
#> # Source:   table<`dbplyr_FrtBjrCm7u`> [?? x 5]
#> # Database: sqlite 3.46.0 []
#>   key_location key_age_bin n_hospital valid_from valid_until
#>   <chr>        <chr>            <dbl>      <dbl>       <dbl>
#> 1 AR           0                   NA      18322       18323
#> 2 AR           0                   NA      18330       18331
#> 3 AR           0                    0      18324       18325
#> 4 AR           0                    1      18323       18324
#> 5 AR           0                    1      18327       18328
#> # ℹ more rows

Notice that features have associated “key_*” and “valid_from/until” columns. These are used for one of the primary selling points of diseasystore, namely automatic aggregation.

Go get features for other time intervals, we can manually supply start_date and/or end_date:

ds$get_feature("n_hospital",
               start_date = as.Date("2020-03-01"),
               end_date = as.Date("2020-03-02"))
#> # Source:   table<`dbplyr_511lAfaY2s`> [?? x 5]
#> # Database: sqlite 3.46.0 []
#>   key_location key_age_bin n_hospital valid_from valid_until
#>   <chr>        <chr>            <dbl>      <dbl>       <dbl>
#> 1 AR           0                   NA      18322       18323
#> 2 AR           0                    1      18323       18324
#> 3 AR           1                    0      18323       18324
#> 4 AR           1                    1      18322       18323
#> 5 AR           2                    1      18322       18323
#> # ℹ more rows

Dynamically expanded

The diseasystore automatically expands the computed features.

Say a given “n_hospital” has been computed between 2020-03-01 and 2020-03-15. In this case, the call $get_feature("n_hospital", start_date = as.Date("2020-03-01"), end_date = as.Date("2020-03-20") only needs to compute the feature between 2020-03-16 and 2020-03-20.

Time versioned

Through using {SCDB} as the back end, the features are stored even as new data becomes available. This way, we get a time-versioned record of the features provided by diseasystore.

The features being computed is controlled through the slice_ts argument. By default, diseasystores uses today’s date for this argument.

The dynamical expansion of the features described above is only valid for any given slice_ts. That is, if a feature has been computed for a time interval on one slice_ts, diseasystore will recompute the feature for any other slice_ts.

This way, feature computation can be implemented into continuous integration (requesting features will preserve a history of computed features). Furthermore, post-hoc analysis can be performed by computing features as they would have looked on previous dates.

Automatic aggregation

The real strength of diseasystore comes from its built-in automatic aggregation.

We saw above that the features come with additional associated “key_*” and “valid_from/until” columns.

This additional information is used to do automatic aggregation through the $key_join_features() method (see extending-diseasystore for more details).

To use this method, you need to provide the observable that you want to aggregate and the stratification you want to apply to the aggregation.

Lets start with an simple example where we request no stratification (NULL):

ds$key_join_features(observable = "n_hospital",
                     stratification = NULL)
#> # A tibble: 15 × 2
#>   date       n_hospital
#>   <date>          <dbl>
#> 1 2020-03-01          3
#> 2 2020-03-02          6
#> 3 2020-03-03          5
#> 4 2020-03-04         12
#> 5 2020-03-05          8
#> # ℹ 10 more rows

This gives us the same feature information as ds$get_feature("n_hospital") but simplified to give the observable per day (in this case, the number of people hospitalised).

To specify a level of stratification, we need to supply a list of quosures (see help("topic-quosure", package = "rlang")).

ds$key_join_features(observable = "n_hospital",
                     stratification = rlang::quos(country_id))
#> # A tibble: 15 × 3
#>   date       country_id n_hospital
#>   <date>     <chr>           <dbl>
#> 1 2020-03-01 AR                  3
#> 2 2020-03-02 AR                  6
#> 3 2020-03-03 AR                  5
#> 4 2020-03-04 AR                 12
#> 5 2020-03-05 AR                  8
#> # ℹ 10 more rows

The stratification argument is very flexible, so we can supply any valid R expression:

ds$key_join_features(observable = "n_hospital",
                     stratification = rlang::quos(country_id,
                                                  old = age_group == "90+"))
#> # A tibble: 30 × 4
#>   date       country_id   old n_hospital
#>   <date>     <chr>      <int>      <dbl>
#> 1 2020-03-01 AR             0         27
#> 2 2020-03-02 AR             0         54
#> 3 2020-03-03 AR             0         45
#> 4 2020-03-04 AR             0        108
#> 5 2020-03-05 AR             0         72
#> # ℹ 25 more rows

Dropping computed features

Sometimes, it is need to clear the compute features from the data base. For this purpose, we provide the drop_diseasystore() function.

By default, this deletes all stored features in the default diseasystore schema. A pattern argument to match tables by and a schema argument to specify the schema to delete from1.

SCDB::get_tables(ds$target_conn)
#>    schema                              table
#> 1    main       ds.google_covid_19_age_group
#> 2    main           ds.google_covid_19_index
#> 3    main        ds.google_covid_19_hospital
#> 4    main                           ds.locks
#> 5    main                            ds.logs
#> 6    temp                ds_validities_17400
#> 7    temp                  dbplyr_SSpXjAGfwr
#> 8    temp                     SCDB_17400_040
#> 9    temp                     SCDB_17400_036
#> 10   temp                  dbplyr_GtRJTuBYT9
#> 11   temp                  dbplyr_SQFYQXK5BY
#> 12   temp ds_google_covid_19_age_group_17400
#> 13   temp                  dbplyr_RdVKfUH6qN
#> 14   temp          ds_all_combinations_17400
#> 15   temp                  dbplyr_sSMF2ujf3P
#> 16   temp                  dbplyr_bca9DxGjG8
#> 17   temp                     SCDB_17400_041
#> 18   temp                  dbplyr_phKSxFcywP
#> 19   temp                     SCDB_17400_032
#> 20   temp                     SCDB_17400_028
#> 21   temp                     SCDB_17400_025
#> 22   temp                  dbplyr_511lAfaY2s
#> 23   temp               ds_study_dates_17400
#> 24   temp                  dbplyr_FrtBjrCm7u
#> 25   temp                  dbplyr_GKzjmx4LWB
#> 26   temp                     SCDB_17400_033
#> 27   temp                  dbplyr_eD5tjxFsFI
#> 28   temp                     SCDB_17400_024
#> 29   temp                  dbplyr_Lqqx8MtXGl
#> 30   temp     ds_google_covid_19_index_17400
#> 31   temp                     SCDB_17400_017
#> 32   temp  ds_google_covid_19_hospital_17400
#> 33   temp                  dbplyr_UpL0EiARt2
#> 34   temp                     SCDB_17400_020
drop_diseasystore(conn = ds$target_conn)

SCDB::get_tables(ds$target_conn)
#>    schema                              table
#> 1    temp                ds_validities_17400
#> 2    temp                  dbplyr_SSpXjAGfwr
#> 3    temp                     SCDB_17400_040
#> 4    temp                     SCDB_17400_036
#> 5    temp                  dbplyr_GtRJTuBYT9
#> 6    temp                  dbplyr_SQFYQXK5BY
#> 7    temp ds_google_covid_19_age_group_17400
#> 8    temp                  dbplyr_RdVKfUH6qN
#> 9    temp          ds_all_combinations_17400
#> 10   temp                  dbplyr_sSMF2ujf3P
#> 11   temp                  dbplyr_bca9DxGjG8
#> 12   temp                     SCDB_17400_041
#> 13   temp                  dbplyr_phKSxFcywP
#> 14   temp                     SCDB_17400_032
#> 15   temp                     SCDB_17400_028
#> 16   temp                     SCDB_17400_025
#> 17   temp                  dbplyr_511lAfaY2s
#> 18   temp               ds_study_dates_17400
#> 19   temp                  dbplyr_FrtBjrCm7u
#> 20   temp                  dbplyr_GKzjmx4LWB
#> 21   temp                     SCDB_17400_033
#> 22   temp                  dbplyr_eD5tjxFsFI
#> 23   temp                     SCDB_17400_024
#> 24   temp                  dbplyr_Lqqx8MtXGl
#> 25   temp     ds_google_covid_19_index_17400
#> 26   temp                     SCDB_17400_017
#> 27   temp  ds_google_covid_19_hospital_17400
#> 28   temp                  dbplyr_UpL0EiARt2
#> 29   temp                     SCDB_17400_020

diseasystore options

diseasystores have a number of options available to make configuration easier. These options all start with “diseasystore.”.

options()[purrr::keep(names(options()), ~ startsWith(., "diseasystore"))]
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.pull
#> [1] TRUE
#> 
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.remote_conn
#> [1] "https://api.github.com/repos/EU-ECDC/Respiratory_viruses_weekly_data"
#> 
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.source_conn
#> [1] "https://api.github.com/repos/EU-ECDC/Respiratory_viruses_weekly_data"
#> 
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.target_conn
#> [1] ""
#> 
#> $diseasystore.DiseasystoreEcdcRespiratoryViruses.target_schema
#> [1] ""
#> 
#> $diseasystore.DiseasystoreGoogleCovid19.n_max
#> [1] 1000
#> 
#> $diseasystore.DiseasystoreGoogleCovid19.remote_conn
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"
#> 
#> $diseasystore.DiseasystoreGoogleCovid19.source_conn
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"
#> 
#> $diseasystore.DiseasystoreGoogleCovid19.target_conn
#> [1] ""
#> 
#> $diseasystore.DiseasystoreGoogleCovid19.target_schema
#> [1] ""
#> 
#> $diseasystore.lock_wait_increment
#> [1] 15
#> 
#> $diseasystore.lock_wait_max
#> [1] 1800
#> 
#> $diseasystore.source_conn
#> [1] ""
#> 
#> $diseasystore.target_conn
#> [1] ""
#> 
#> $diseasystore.target_schema
#> [1] "ds"
#> 
#> $diseasystore.verbose
#> [1] FALSE

Notice that several options are set as empty strings (““). These are treated as NULL by diseasystore2.

Importantly, the options are scoped. Consider the above options for “source_conn”: Looking at the list of options we find “diseasystore.source_conn” and “diseasystore.DiseasystoreGoogleCovid19.source_conn”. The former is a general setting while the latter is specific setting for DiseasystoreGoogleCovid19. The general setting is used as fallback if no specific setting is found.

This allows you to set a general configuration to use and to overwrite it for specific cases.

To get the option related to a scope, we can use the diseasyoption() function.

diseasyoption("source_conn", class = "DiseasystoreGoogleCovid19")
#> [1] "https://storage.googleapis.com/covid19-open-data/v3/"

As we saw in the options, a source_conn option was defined specifically for DiseasystoreGoogleCovid19.

If we try the same for the hypothetical DiseasystoreDiseaseY, we see that no value is defined as we have not yet configured the fallback value.

diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
#> NULL

If we change our general setting for source_conn and retry, we see that we get the fallback value.

options("diseasystore.source_conn" = file.path("local", "path"))
diseasyoption("source_conn", class = "DiseasystoreDiseaseY")
#> [1] "local/path"

Finally, we can use the .default argument as a final fallback value in case no option is set for either general or specific case.

diseasyoption("non_existent", class = "DiseasystoreDiseaseY", .default = "final fallback")
#> [1] "final fallback"

  1. If using SQLite as the back end, it will instead prepend the schema specification to the pattern before matching (e.g. “ds\..*“).↩︎

  2. R’s options() does not allow setting an option to NULL. By setting options as empty strings, the user can see the available options to set.↩︎