--- title: "Cluster Management" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Cluster Management} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` `{brickster}` has 1:1 mappings with the clusters REST API, enabling full control of Databricks clusters from your R session. ## Cluster Creation Clusters have a number of parameters and can be configured to match to needs of a given workload. `db_cluster_create()` facilitates creation of a cluster in a Databricks workspace for all cloud platforms (AWS, Azure, GCP). Depending on the cloud you will need to change the node types and `cloud_attrs` to be one of; `aws_attributes()`, `azure_attributes()`, or `gcp_attributes()`. Below we will create a cluster on AWS and then step through using the other supporting functions. ```{r setup} library(brickster) # create a small cluster on AWS with DBR 9.1 LTS new_cluster <- db_cluster_create( name = "brickster-cluster", spark_version = "9.1.x-scala2.12", num_workers = 2, node_type_id = "m5a.xlarge", cloud_attrs = aws_attributes( ebs_volume_count = 3, ebs_volume_size = 100 ) ) ``` ```{r, echo=FALSE, results='hide'} temp <- get_and_start_cluster(cluster_id = new_cluster$cluster_id) ``` Refer to documentation for details on how to use other parameters not mentioned here (e.g. `spark_conf`). Before creating a cluster you may want to check the supported values for a number of the parameters. There are functions to assist with this: | Function | Purpose | |----------------:|-------------------------------------------------------| | `db_cluster_runtime_versions()` | List of runtime versions available for the workspace, useful for finding relevant `spark_version` | | `db_cluster_list_node_types()` | List of supported node types available in workspace/region, useful for finding relevant `node_type_id`/`driver_node_type_id` | | `db_cluster_list_zones()` | AWS Only, lists availability zones (AZ) clusters can occupy | `db_cluster_get()` will provide details for the cluster we just created, including information such as the state. This can be useful as you may wish to wait for the cluster to be `RUNNING` , which is exactly what `get_and_start_cluster()` uses internally to wait until the cluster is running before completing. ```{r} cluster_info <- db_cluster_get(cluster_id = new_cluster$cluster_id) cluster_info$state ``` ## Editing Clusters You can edit Databricks clusters to change various parameters using `db_cluster_edit()`. For example, we may decide we want our cluster to autoscale between 2-8 nodes and add some tags. ```{r, results='hide'} # we are required to input all parameters db_cluster_edit( cluster_id = new_cluster$cluster_id, name = "brickster-cluster", spark_version = "9.1.x-scala2.12", node_type_id = "m5a.xlarge", autoscale = cluster_autoscale(min_workers = 2, max_workers = 8), cloud_attrs = aws_attributes( ebs_volume_count = 3, ebs_volume_size = 100 ), custom_tags = list( purpose = "brickster_cluster_demo" ) ) ``` However, if the intention was to only change the size of a given cluster the `db_cluster_resize()` function is a simpler alternative. I can either adjust the number of workers or change the autoscale range. If the range or workers is adjusted via `autoscale` the number of workers active on the cluster will be increased/decreased if they are outside the bounds. ```{r, results='hide'} # adjust number autoscale range to be between 4-6 workers db_cluster_resize( cluster_id = new_cluster$cluster_id, autoscale = cluster_autoscale(min_workers = 4, max_workers = 6) ) ``` It's important to note that if specifying `num_workers` instead of `autoscale` on a cluster than has an existing autoscale range it will become a fixed number of workers from that point onward. Databricks clusters can be "pinned" which stops them from being removed after 30 days of termination. `db_cluster_pin()` and `db_cluster_unpin()` are the functions used for changing if a cluster is "pinned" or not. ```{r, results='hide'} # pin the cluster db_cluster_pin(cluster_id = new_cluster$cluster_id) # unpin the cluster # db_cluster_unpin(cluster_id = new_cluster$cluster_id) ``` ## Cluster State There are a few functions that can be used to to manage the state of an existing cluster | Function | Purpose | |-----------------:|------------------------------------------------------| | `db_cluster_start()` | Start a cluster that is inactive | | `db_cluster_restart()` | Restart a cluster, cluster must already be running | | `db_cluster_delete()` /`db_cluster_terminate()` | Terminate an active cluster, does not remove the cluster configuration from Databricks | | `db_cluster_perm_delete()` | Stops (if active) and permanently deletes a cluster, it will not longer appear in Databricks | ## Cluster Libraries Databricks clusters can have libraries installed from a number of sources using `db_libs_install()` and the associated `libs_*()` functions: | Function | Library Source | |--------------:|---------------------| | `lib_cran()` | CRAN | | `lib_pypi()` | PyPi | | `lib_egg()` | Python egg (file) | | `lib_whl()` | Python wheel (file) | | `lib_maven()` | Maven | | `lib_jar()` | JAR (file) | ```{r, results='hide'} # installing a package from CRAN on cluster db_libs_install( cluster_id = new_cluster$cluster_id, libraries = libraries( lib_cran(package = "palmerpenguins"), lib_cran(package = "dplyr") ) ) ``` For convenience the `wait_for_lib_installs()` function will block until all the libraries for the specified cluster have finished installing. ```{r, results='hide'} wait_for_lib_installs(cluster_id = new_cluster$cluster_id) ``` Installation of libraries is asynchronous and will complete in the background. `db_libs_cluster_status()` is used to check on the installation status of libraries for a given cluster, `db_libs_all_cluster_statuses()` is used for getting the status of all libraries across all clusters in the workspace. ```{r} db_libs_cluster_status(cluster_id = new_cluster$cluster_id) ``` Libraries can be uninstalled using `db_libs_uninstall()`. ```{r, results='hide'} db_libs_uninstall( cluster_id = new_cluster$cluster_id, libraries = libraries( lib_cran(package = "palmerpenguins") ) ) ``` Using `db_libs_cluster_status()` shows that the library will be uninstalled upon restart (e.g. `db_cluster_restart()`). ```{r} db_libs_cluster_status(cluster_id = new_cluster$cluster_id) ``` ## Events A list of events regarding the clusters activity can be fetched via `db_cluster_events()`. There are many [event types](https://docs.databricks.com/dev-tools/api/latest/clusters.html#clustereventtype) that can occur, and by default the 50 most recent events are returned. ```{r} events <- db_cluster_events(cluster_id = new_cluster$cluster_id) head(events, 1) ``` ```{r, echo=FALSE, results='hide'} db_cluster_unpin(cluster_id = new_cluster$cluster_id) db_cluster_perm_delete(cluster_id = new_cluster$cluster_id) ```