Stepping into the HYDAT Database

Dewey Dunnington

2024-01-09

The HYDAT database is a massive hydrologic data resource. The functions in this package are designed to get the most out of the HYDAT database as quickly as possible, however in the process of reformatting the data to be more useful, the package modifies the original tables. This vignette is intended to demonstrate how to access tables within the database for your own custom HYDAT analysis, should additional information be needed. This should not be necessary for the vast majority of users, and is intended only for advanced R users.

Downloading HYDAT

Before loading the HYDAT database, the latest version of the database must be downloaded using hydat_download(). This is a fairly lengthy operation (the download is around 1 GB) and may require several cups of coffee worth of your time.

hydat_download()

Working with HYDAT tables

The HYDAT database is a SQLite database, which can be accessed in R using the dplyr and dbplyr packages. This package has simplified the connection process, so all you have to do to connect to the database is use hy_src().

src <- hy_src()

To list the tables, use src_tbls() from the dplyr package.

src_tbls(src)
#>  [1] "AGENCY_LIST"            "ANNUAL_INSTANT_PEAKS"  
#>  [3] "ANNUAL_STATISTICS"      "CONCENTRATION_SYMBOLS" 
#>  [5] "DATA_SYMBOLS"           "DATA_TYPES"            
#>  [7] "DATUM_LIST"             "DLY_FLOWS"             
#>  [9] "DLY_LEVELS"             "MEASUREMENT_CODES"     
#> [11] "OPERATION_CODES"        "PEAK_CODES"            
#> [13] "PRECISION_CODES"        "REGIONAL_OFFICE_LIST"  
#> [15] "SAMPLE_REMARK_CODES"    "SED_DATA_TYPES"        
#> [17] "SED_DLY_LOADS"          "SED_DLY_SUSCON"        
#> [19] "SED_SAMPLES"            "SED_SAMPLES_PSD"       
#> [21] "SED_VERTICAL_LOCATION"  "SED_VERTICAL_SYMBOLS"  
#> [23] "STATIONS"               "STN_DATA_COLLECTION"   
#> [25] "STN_DATA_RANGE"         "STN_DATUM_CONVERSION"  
#> [27] "STN_DATUM_UNRELATED"    "STN_OPERATION_SCHEDULE"
#> [29] "STN_REGULATION"         "STN_REMARKS"           
#> [31] "STN_REMARK_CODES"       "STN_STATUS_CODES"      
#> [33] "VERSION"

To inspect any particular table, use the tbl() function with the src and the table name.

tbl(src, "STN_OPERATION_SCHEDULE")
#> # Source:   table<STN_OPERATION_SCHEDULE> [?? x 5]
#> # Database: sqlite 3.41.2 [/Users/samalbers/_dev/gh_repos/tidyhydat/inst/test_db/tinyhydat.sqlite3]
#>    STATION_NUMBER DATA_TYPE  YEAR MONTH_FROM MONTH_TO
#>    <chr>          <chr>     <int> <chr>      <chr>   
#>  1 05AA008        H          2012 JAN        DEC     
#>  2 05AA008        H          2013 JAN        DEC     
#>  3 05AA008        H          2014 JAN        DEC     
#>  4 05AA008        H          2015 JAN        DEC     
#>  5 05AA008        H          2016 JAN        DEC     
#>  6 05AA008        H          2017 JAN        DEC     
#>  7 05AA008        H          2018 JAN        DEC     
#>  8 05AA008        H          2019 JAN        DEC     
#>  9 05AA008        H          2020 JAN        DEC     
#> 10 05AA008        Q          1910 <NA>       <NA>    
#> # ℹ more rows

Working with SQL tables in dplyr is much like working with regular data frames, except no data is actually read from the database until necessary. Because some of these tables are large (particularly those containing the actual data), you will want to filter() the tables before you collect() them (the collect() operation loads them into memory as a data.frame).

tbl(src, "STN_OPERATION_SCHEDULE") %>%
  filter(STATION_NUMBER == "05AA008") %>%
  collect()
#> # A tibble: 103 × 5
#>    STATION_NUMBER DATA_TYPE  YEAR MONTH_FROM MONTH_TO
#>    <chr>          <chr>     <int> <chr>      <chr>   
#>  1 05AA008        H          2012 JAN        DEC     
#>  2 05AA008        H          2013 JAN        DEC     
#>  3 05AA008        H          2014 JAN        DEC     
#>  4 05AA008        H          2015 JAN        DEC     
#>  5 05AA008        H          2016 JAN        DEC     
#>  6 05AA008        H          2017 JAN        DEC     
#>  7 05AA008        H          2018 JAN        DEC     
#>  8 05AA008        H          2019 JAN        DEC     
#>  9 05AA008        H          2020 JAN        DEC     
#> 10 05AA008        Q          1910 <NA>       <NA>    
#> # ℹ 93 more rows

When you are finished with the database (i.e., the end of the script), it is good practice to close the connection (you may get a loud red warning if you don’t!).

hy_src_disconnect(src)