minioclient

R-CMD-check Lifecycle: stable CRAN status

Rationale

There are numerous packages that already interface with the AWS S3 protocol for object storage. Most rely directly on calls to the low-level S3 REST API through R packages such as curl or httr, which requires significant amounts of code to provide high-level functionality (e.g. handling authentication, paging over results, parsing returned XML), and is thus prone to inefficiency and bugs. Many also implicitly assume that Amazon is the underlying provider, making it difficult or impossible to work with a substantial and growing number of object stores now conform to the AWS S3 standard. These include NSF’s OpenStorageNetwork, Jetstream2 (both based on open source Redhat CEPH), NCAR’s Stratus (based on Western Digital S3), and MinIO Servers (another open source implementation popular with companies and developers), as well as Google Cloud Storage’s S3 compatibility mode.

In contrast, the MinIO Client, an open-source, AGPL-v3 software developed in the Go language by the MinIO team, provides a high-performance utility with intuitive design for working across multiple cloud-based object stores as well as local filesystems. This package provides a thin R wrapper around that client – maximizing performance and minimizing potential for maintenance and bugs. A helper utility provides a convenient way to install and update the golang binary across operating systems and architectures. The client supports parallel threads by default, intuitive handling of bucket permissions such as granting or revoking anonymous access, and persistent configurations across multiple clouds. After struggling against the limitations of many different R wrappers for S3 object stores, this is now my go-to.

Installation

You can install the development version of minioclient from GitHub with:

# install.packages("devtools")
devtools::install_github("cboettig/minioclient")

MinIO Client

At first use, all operations will attempt to install the client (after prompting) if not already installed. Users can also install latest version of the minio client can be installed using install_mc.

library(minioclient)
install_mc()

The MinIO client is designed to support multiple endpoints for cloud storage, including AWS, Google Cloud Storage (via S3-compatibility), and other S3 compatible clients such as open source MinIO or Redhat CEPH storage systems. MinIO uses a syntax based around aliases to allow access across multiple platforms. Aliases can be configured using access key pairs to allow authenticated access.

Aliases

By default, the client comes pre-configured with credentials for the MinIO play platform, designed for public experimental storage and examples. We can use mc_alias_ls() to see all clients, specify the client we want:

mc_alias_ls("play")

Some S3 object storage systems allow access without credentials. Confusingly, attempting to access public data with invalid credentials will still fail, so we need to specify an anonymous endpoint with no credentials. By default, mc_alias_set will seek to use AWS_S3_ENDPOINT, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your environment, if set. This allows minioclient to be used in scripts with authentication keys passed in securely as environmental variables. To set an anonymous access, simply indicate empty credentials, like so:

mc_alias_set("anon", "s3.amazonaws.com", access_key = "", secret_key = "")

Configuration of aliases is stored in a persistent configuration file, so aliases need be created only once on a given machine. All mc functions specify which cloud provider using a filepath notation, <ALIAS>/<BUCKET>/<PATH>. For instance, we can list all objects found in the bucket gbif-open-data-us-east-1, which is a public bucket included in the AWS Open Data Registry:

mc_ls("anon/gbif-open-data-us-east-1")
#> [1] "index.html"  "occurrence/"

All mc functions can also understand local filesystem paths. Any absolute path (path starting with /), or any relative path not recognized as a registered alias (Note: be careful not to have local folders using the same name as remote aliases!) will be interpreted as a local path. For instance, we can list the contents of the local R/ directory:

mc_ls("R")
#>  [1] "install_mc.R"    "mc.R"            "mc_alias.R"      "mc_anonymous.R" 
#>  [5] "mc_cat.R"        "mc_config_set.R" "mc_cp.R"         "mc_diff.R"      
#>  [9] "mc_du.R"         "mc_head.R"       "mc_ls.R"         "mc_mb.R"        
#> [13] "mc_mirror.R"     "mc_mv.R"         "mc_rb.R"         "mc_rm.R"        
#> [17] "mc_sql.R"        "mc_stat.R"

Uploads & Downloads

This notation makes it easy to move data between local and remote systems, or even between two remote systems. Let’s copy the index.html file from GBIF to our local file system.

mc_cp("anon/gbif-open-data-us-east-1/index.html", "gbif.html")

Just to prove this is indeed a local copy, we can list local directory:

fs::file_info("gbif.html")
#> # A tibble: 1 × 18
#>   path       type     size permissions modification_time   user  group device_id
#>   <fs::path> <fct> <fs::b> <fs::perms> <dttm>              <chr> <chr>     <dbl>
#> 1 gbif.html  file    31.6K rw-r--r--   2023-11-05 22:54:15 cboe… cboe…     66307
#> # ℹ 10 more variables: hard_links <dbl>, special_device_id <dbl>, inode <dbl>,
#> #   block_size <dbl>, blocks <dbl>, flags <int>, generation <dbl>,
#> #   access_time <dttm>, change_time <dttm>, birth_time <dttm>

For any object store where we have adequate permissions, we can create new buckets:

random_name <- paste0(sample(letters, 12, replace = TRUE), collapse = "")
play_bucket <- paste0("play/play-", random_name)

mc_mb(play_bucket)
#> Bucket created successfully `play/play-hmdzuvevfzdi`.

We can copy files or directories to the remote bucket:

mc_cp("anon/gbif-open-data-us-east-1/index.html", play_bucket)
mc_cp("R/", play_bucket, recursive = TRUE, verbose = TRUE)
#> `/home/cboettig/cboettig/minioclient/R/mc.R` -> `play/play-hmdzuvevfzdi/mc.R`
#> `/home/cboettig/cboettig/minioclient/R/install_mc.R` -> `play/play-hmdzuvevfzdi/install_mc.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_alias.R` -> `play/play-hmdzuvevfzdi/mc_alias.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_anonymous.R` -> `play/play-hmdzuvevfzdi/mc_anonymous.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_config_set.R` -> `play/play-hmdzuvevfzdi/mc_config_set.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_cat.R` -> `play/play-hmdzuvevfzdi/mc_cat.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_cp.R` -> `play/play-hmdzuvevfzdi/mc_cp.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_diff.R` -> `play/play-hmdzuvevfzdi/mc_diff.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_du.R` -> `play/play-hmdzuvevfzdi/mc_du.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_head.R` -> `play/play-hmdzuvevfzdi/mc_head.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_ls.R` -> `play/play-hmdzuvevfzdi/mc_ls.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mb.R` -> `play/play-hmdzuvevfzdi/mc_mb.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mirror.R` -> `play/play-hmdzuvevfzdi/mc_mirror.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_mv.R` -> `play/play-hmdzuvevfzdi/mc_mv.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_rb.R` -> `play/play-hmdzuvevfzdi/mc_rb.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_rm.R` -> `play/play-hmdzuvevfzdi/mc_rm.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_sql.R` -> `play/play-hmdzuvevfzdi/mc_sql.R`
#> `/home/cboettig/cboettig/minioclient/R/mc_stat.R` -> `play/play-hmdzuvevfzdi/mc_stat.R`
#> Total: 0 B, Transferred: 22.00 KiB, Speed: 314.03 KiB/s

Note the use of recursive = TRUE to transfer all objects matching the pattern. In S3 object stores, file paths are really just prefixes, thus this query includes not only everything in the R folder, but also README.md, since it also matches the prefix. (Had we used the prefix R/, README.md would not be matched and the R scripts would go directly into play_bucket root instead of an R/ sub-path.)

We can examine disk usage of remote objects or directories:

mc_du(play_bucket)

We can also adjust permissions for anonymous access:

mc_anonymous_set(play_bucket, "download")

Public objects can be accessed directly over HTTPS connection using the endpoint URL, bucket name and path:

bucket <-  basename(play_bucket) # strip alias from path
# use full domain name as prefix instead:
public_url <- paste0("https://play.min.io/", bucket, "/index.html")
download.file(public_url, "index.html", quiet = TRUE)

Additional functionality

Any command supported by the minio client can be accessed using the function mc(). This function can be used in place of any of the above methods, or to access additional methods where no wrapper exists, see mc("-h") for complete list. R functions such as mc_ls() are merely helpful wrappers around the more generic mc() utility, e.g. mc("ls play") is equivalent to mc_ls("play"). Providing helper methods allows tab-completion discovery of functions, R-based documentation, and improved handling of display behavior (e.g. verbose=FALSE by default on certain commands.) See official mc client docs for details.

In addition to usual R documentation, users can display full help information for any method using the argument "-h". This includes details on optional flags and further examples.

mc_du("-h")

We can now use arbitrary mc commands (see quickstart). For example, examine file information to confirm that eTags (md5sums here) match for these objects:

mc(paste("stat", "anon/gbif-open-data-us-east-1/index.html", paste0(play_bucket, "/index.html")))