--- title: "Creating a cdm reference using Spark" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{a03_creating_a_cdm_reference} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` So far we've been using a local Spark connection for introducing the OmopOnSpark package. However, in practice, when working with patient-level health data our data will most likely be in the cloud-based Databricks plaform which is built around Apache Spark. Once we have created our cdm reference, the same code we have seen when working with a local Spark dataset will also work with Databricks. It is *just* that the way we connect will differ. To create your connection to https://spark.posit.co/deployment/databricks-connect.html. Briefly, first you would save environmental variables. ```{r} usethis::edit_r_environ() DATABRICKS_HOST = "Enter here your Workspace URL" DATABRICKS_TOKEN = "Enter here your personal token" ``` With these saved you should now be able to connect with sparklyr, specifying your cluster ID. ```{r} library(sparklyr) con <- spark_connect( cluster_id = "Enter here your cluster ID", method = "databricks_connect" ) con ``` With this, we can check that everything is working and we have an open connection ```{r} connection_is_open(con) ``` With this, we we should be able to create a reference to a table. Let's say we our OMOP CDM data is in a catalog called "my_catalog" and a schema called "my_omop_schema". We should be able to create a reference to our person table. ```{r} library(dplyr) tbl(con, I("my_catalog.my_omop_schema.person")) ``` We should be able to collect the first five rows of this table into R ```{r} tbl(con, I("my_catalog.my_omop_schema.person")) |> head(5) |> collect() ``` As well as this we should be able to go in the other direction and copy data from R to a Spark dataframe. ```{r} spark_cars_df <- sdf_copy_to(con, cars, overwrite = TRUE) spark_cars_df ``` If these basics are working we should be well set-up to start working with OmopOnSpark. Here we would spefify our cdm schema as we've seen above. And now let's say we have another schema called "my_results_schema" where we want to save any study-specific tables. We'll use this when specifying the write schema. In addition, we can also give a write prefix and all the tables we create during the course of the working with this cdm reference will start with this prefix. ```{r, eval=FALSE} library(OmopOnSpark) cdm <- cdmFromSpark(con, cdmSchema = "my_catalog.my_omop_schema", writeSchema = "my_catalog.my_results_schema", writePrefix = "study_1_" ) cdm ```