--- title: "googleCloudStorageR" author: "Mark Edmondson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{googleCloudStorageR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- An R library for interacting with the Google Cloud Storage JSON API ([api docs](https://cloud.google.com/storage/docs/json_api/)). ## Setup Google Cloud Storage charges you for storage [(prices here)](https://cloud.google.com/storage/pricing). You can use your own Google Project with a credit card added to create buckets, where the charges will apply. This can be done in the [Google API Console](https://console.developers.google.com) ## Configuring your own Google Project You will first need to [create a Google Cloud Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects) and make sure that the Cloud Storage API is turned on. It is by default for new projects. The recommended way is to use `gcs_setup()` which will help you create and download an authentication JSON key, and set your default bucket. ```r library(googleCloudStorageR) gcs_setup() #ℹ ==Welcome to googleCloudStorageR v0.6.0 setup== #This wizard will scan your system for setup options and help you with any that are missing. #Hit 0 or ESC to cancel. # #1: Create and download JSON service account key #2: Setup auto-authentication (JSON service account key) #3: Setup default bucket # #Selection: | ``` It uses `googleAuthR::gar_setup_menu()` to create the wizard. You will need to have owner access to the project you are using. After each menu option has completed, restart R and rerun `gcs_setup()` function to continue to the next step. Upon successful set-up, you should see a message similar to below: ```r library(googleCloudStorageR) #Setting scopes to https://www.googleapis.com/auth/devstorage.full_control and https://www.googleapis.com/auth/cloud-platform #Successfully auto-authenticated via /Users/xxxx/googlecloudstorager-auth-key.json #Set default bucket name to 'xxxxxx' ``` ## Manual setup `gcs_setup()` works through the steps detailed below. The instructions below are for when you visit the Google API console (`https://console.developers.google.com/apis/`) ### Activate API 1. Click on "APIs" 2. Select and activate the Cloud Storage JSON API if not already active ### Set environment variables By default, all `cloudyr` packages look for the access key ID and secret access key in environment variables. You can also use this to specify a default bucket, and auto-authentication upon attaching the library. For example: ```r Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket", "GCS_AUTH_FILE" = "/fullpath/to/service-auth.json") ``` These can alternatively be set on the command line or via an Renviron.site or .Renviron file (`https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html`). e.g. In your `.Renviron`: ``` GCS_AUTH_FILE="/fullpath/to/service-auth.json" GCS_DEFAULT_BUCKET=my-default-bucket ``` ### Auto-authentication The best method for authentication is to use your own Google Cloud Project. You can specify the location of a service account JSON file taken from your Google Project: Sys.setenv("GCS_AUTH_FILE" = "/fullpath/to/auth.json") This file will then used for authentication via `gcs_auth()` when you load the library: ```r ## GCS_AUTH_FILE set so auto-authentication library(googleCloudStorageR) gcs_get_bucket("your-bucket") ``` ### Token-authentication You can also seamlessly authenticate as the service account your compute resource (i.e. Cloud-Function, AI Notebook etc...) is using by requesting a token and then passing that token to `gcs_auth` with the [gargle](https://gargle.r-lib.org/) library ```r ## Load googleCloudStorageR and gargle library(googleCloudStorageR) library(gargle) ## Fetch token. See: https://developers.google.com/identity/protocols/oauth2/scopes scope <-c("https://www.googleapis.com/auth/cloud-platform") token <- token_fetch(scopes = scope) ## Pass your token to gcs_auth gcs_auth(token = token) ## Perform gcs operations as normal gcs_list_objects(bucket = "my-bucket") ``` ### Setting a default Bucket To avoid specifying the bucket in the functions below, you can set the name of your default bucket via environmental variables or via the function `gcs_global_bucket()`. See the `Setting environment variables` section for more details. ```r ## set bucket via environment Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket") library(googleCloudStorageR) ## check what the default bucket is gcs_get_global_bucket() [1] "my-default-bucket" ## you can also set a default bucket after loading the library for that session gcs_global_bucket("your-default-bucket-2") gcs_get_global_bucket() [1] "my-default-bucket-2" ``` ## Downloading objects from Google Cloud storage Once you have a Google project and created a bucket with an object in it, you can download it as below: ```r library(googleCloudStorageR) ## get your project name from the API console proj <- "your-project" ## get bucket info buckets <- gcs_list_buckets(proj) bucket <- "your-bucket" bucket_info <- gcs_get_bucket(bucket) bucket_info ==Google Cloud Storage Bucket== Bucket: your-bucket Project Number: 1123123123 Location: EU Class: STANDARD Created: 2016-04-28 11:39:06 Updated: 2016-04-28 11:39:06 Meta-generation: 1 eTag: Cxx= ## get object info in the default bucket objects <- gcs_list_objects() ## save directly to an R object (warning, don't run out of RAM if its a big object) ## the download type is guessed into an appropriate R object parsed_download <- gcs_get_object(objects$name[[1]]) ## if you want to do your own parsing, set parseObject to FALSE ## use httr::content() to parse afterwards raw_download <- gcs_get_object(objects$name[[1]], parseObject = FALSE) ## save directly to a file in your working directory ## parseObject has no effect, it is a httr::content(req, "raw") download gcs_get_object(objects$name[[1]], saveToDisk = "csv_downloaded.csv") ``` ## Uploading objects - simple uploads Objects can be uploaded via files saved to disk, or passed in directly if they are data frames or list type R objects. By default, data frames will be converted to CSV via `write.csv()`, lists to JSON via `jsonlite::toJSON`. If you want to use other functions for transforming R objects, for example setting `row.names = FALSE` or using `write.csv2`, pass the function through `object_function` ```r ## upload a file - type will be guessed from file extension or supply type write.csv(mtcars, file = filename) gcs_upload(filename) ## upload an R data.frame directly - will be converted to csv via write.csv gcs_upload(mtcars) ## upload an R list - will be converted to json via jsonlite::toJSON gcs_upload(list(a = 1, b = 3, c = list(d = 2, e = 5))) ## upload an R data.frame directly, with a custom function ## function should have arguments 'input' and 'output' ## safest to supply type too f <- function(input, output) write.csv(input, row.names = FALSE, file = output) gcs_upload(mtcars, object_function = f, type = "text/csv") ``` Since 2019 you can also set bucket level access permissions. To upload to those buckets, specify the `defaultAcl="bucketLevel"` ```r gcs_upload(mtcars, bucket = "mark-bucketlevel-acl", predefinedAcl = "bucketLevel") ``` ## Upload metadata You can pass metadata with an object via the function `gcs_metadata_object()`. the name you pass to the metadata object will override the name if it is also set elsewhere. ```r meta <- gcs_metadata_object("mtcars.csv", metadata = list(custom1 = 2, custom_key = 'dfsdfsdfsfs)) gcs_upload(mtcars, object_metadata = meta) ``` ## Resumable uploads for files > 5MB up to 5TB If the file/object is small, simple uploads are used. You can modify this limit using `option(googleCloudStorageR.upload_limit)` or `gcs_upload_set_limit()` - default is 5000000L or 5MB (#120) For files greater than the upload limit, [resumable uploads](https://cloud.google.com/storage/docs/uploads-downloads) are used. This allows you to upload up to 5TB. If you get an interrupted connection when uploading, `gcs_upload` will retry 3 times, if it fails it will return a Retry object, that you can try again later from where the upload stopped. Call this via `gcs_retry_upload` ```r ## write a big object to a file big_file <- "big_filename.csv" write.csv(big_object, file = big_file) ## attempt upload upload_try <- gcs_upload(big_file) ## if successful, upload_try is an object metadata object upload_try ==Google Cloud Storage Object== Name: "big_filename.csv" Size: 8.5 Gb Media URL https://www.googleapis.com/download/storage/v1/b/xxxx Bucket: your-bucket ID: your-bucket/"test.pdf"/xxxx MD5 Hash: rshao1nxxxxxY68JZQ== Class: STANDARD Created: 2016-08-12 17:33:05 Updated: 2016-08-12 17:33:05 Generation: 1471023185977000 Meta Generation: 1 eTag: CKi90xxxxxEAE= crc32c: j4i1sQ== ## if unsuccessful after 3 retries, upload_try is a Retry object ==Google Cloud Storage Upload Retry Object== File Location: big_filename.csv Retry Upload URL: http://xxxx Created: 2016-08-12 17:33:05 Type: csv File Size: 8.5 Gb Upload Byte: 4343 Upload remaining: 8.1 Gb ## you can retry to upload the remaining data using gcs_retry_upload() try2 <- gcs_retry_upload(upload_try) ``` ## Updating user access to objects You can change who can access objects via `gcs_update_acl` to one of `READER` or `OWNER`, on a user, group, domain, project or public for all users or authenticated users. By default you are "OWNER" of all the objects and buckets you upload and create. ```r ## update access of object to READER for all public gcs_update_object_acl("your-object.csv", entity_type = "allUsers") ## update access of object for user joe@blogs.com to OWNER gcs_update_acl("your-object.csv", entity = "joe@blogs.com", role = "OWNER") ## update access of object for googlegroup users to READER gcs_update_object_acl("your-object.csv", entity = "my-group@googlegroups.com", entity_type = "group") ## update access of object for all users to OWNER on your Google Apps domain gcs_update_object_acl("your-object.csv", entity = "yourdomain.com", entity_type = "domain", role = OWNER) ``` Since 2019 you can also set bucket level access permissions. To upload to those buckets, specify the `defaultAcl="bucketLevel"` ```r gcs_upload(mtcars, bucket = "mark-bucketlevel-acl", predefinedAcl = "bucketLevel") ``` ## Deleting an object Delete an object by passing its name (and bucket if not default) ```r ## returns TRUE is successful, a 404 error if not found gcs_delete_object("your-object.csv") ``` ### Viewing current access level to objects Use `gcs_get_object_acl()` to see what the current access is for an `entity` + `entity_type`. ```r ## default entity_type is user acl <- gcs_get_object_acl("your-object.csv", entity = "joe@blogs.com") acl$role [1] "OWNER" ## for allUsers and allAuthenticated users, you don't need to supply entity acl <- gcs_get_object_acl("your-object.csv", entity_type = "allUsers") acl$role [1] "READER" ``` ## Creating download links Once a user (or group or the public) has access, they can reach that object via a download link generated by the function `gcs_download_url` ```r download_url <- gcs_download_url("your-object.csv") download_url [1] "https://storage.cloud.google.com/your-project/your-object.csv" ``` ## R Session helpers Versions of `save.image()`, `save()` and `load()` are implemented called `gcs_save_image()`, `gcs_save()` and `gcs_load()`. These functions save and load the global R session to the cloud. ```r ## save the current R session including all objects gcs_save_image() ### wipe environment rm(list = ls()) ## load up environment again gcs_load() ``` Save specific objects: ```r cc <- 3 d <- "test1" gcs_save("cc","d", file = "gcs_save_test.RData") ## remove the objects saved in cloud from local environment rm(cc,d) ## load them back in from GCS gcs_load(file = "gcs_save_test.RData") cc == 3 [1] TRUE d == "test1" [1] TRUE ``` You can also upload `.R` code files and source them directly using `gcs_source`: ```r ## make a R source file and upload it cat("x <- 'hello world!'\nx", file = "example.R") gcs_upload("example.R", name = "example.R") ## source the file to run its code gcs_source("example.R") ## the code from the upload file has run x [1] "hello world!" ``` ## Uploading via a Shiny app The library is also compatible with Shiny authentication flows, so you can create Shiny apps that lets users log in and upload their own data. An example of that is shown below: ```r library("shiny") library("googleAuthR") library("googleCloudStorageR") ## you need to start Shiny app on port 1221 ## as thats what the default googleAuthR project expects for OAuth2 authentication ## options(shiny.port = 1221) ## print(source('shiny_test.R')$value) or push the "Run App" button in RStudio shinyApp( ui = shinyUI( fluidPage( googleAuthR::googleAuthUI("login"), fileInput("picture", "picture"), textInput("filename", label = "Name on Google Cloud Storage",value = "myObject"), actionButton("submit", "submit"), textOutput("meta_file") ) ), server = shinyServer(function(input, output, session){ access_token <- shiny::callModule(googleAuth, "login") meta <- eventReactive(input$submit, { message("Uploading to Google Cloud Storage") # from googleCloudStorageR with_shiny(gcs_upload, file = input$picture$datapath, # enter your bucket name here bucket = "gogauth-test", type = input$picture$type, name = input$filename, shiny_access_token = access_token()) }) output$meta_file <- renderText({ req(meta()) str(meta()) paste("Uploaded: ", meta()$name) }) }) ) ``` ## Bucket administration There are various functions to manipulate Buckets: * `gcs_list_buckets` * `gcs_get_bucket` * `gcs_create_bucket` ## Object administration You can get meta data about an object by passing `meta=TRUE` to `gcs_get_object` ```r gcs_get_object("your-object", "your-bucket", meta = TRUE) ``` ## Explanation of Google Project access `googleCloudStorageR` has its own Google project which is used to call the Google Cloud Storage API, but does not have access to the objects or buckets in your Google Project unless you give permission for the library to access your own buckets during the OAuth2 authentication process. No other user, including the owner of the Google Cloud Storage API project has access unless you have given them access, but you may want to change to use your own Google Project (that could or could not be the same as the one that holds your buckets).