--- title: "Getting Started with rsynthbio" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with synthesizeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # Set to FALSE since API calls require credentials ) ``` # rsynthbio `rsynthbio` is an R package that provides a convenient interface to the Synthesize Bio API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq, single-cell RNA-seq, microarray data, and more. ## How to install You can install `rsynthbio` from CRAN: ```{r installation, eval=FALSE} install.packages("rsynthbio") ``` If you want the development version, you can install using the `remotes` package to install from GitHub: ```{r github-installation, eval=FALSE} if (!("remotes" %in% installed.packages())) { install.packages("remotes") } remotes::install_github("synthesizebio/rsynthbio") ``` Once installed, load the package: ```{r} library(rsynthbio) ``` ## Authentication Before using the Synthesize Bio API, you need to set up your API token. The package provides a secure way to handle authentication: ```{r auth-secure, eval=FALSE} # Securely prompt for and store your API token # The token will not be visible in the console set_synthesize_token() # You can also store the token in your system keyring for persistence # across R sessions (requires the 'keyring' package) set_synthesize_token(use_keyring = TRUE) ``` Loading your API key for a session. ```{r, eval=FALSE} # In future sessions, load the stored token load_synthesize_token_from_keyring() # Check if a token is already set has_synthesize_token() ``` You can obtain an API token by registering at [Synthesize Bio](https://app.synthesize.bio). ### Security Best Practices For security reasons, remember to clear your token when you're done: ```{r clear-token, eval = FALSE} # Clear token from current session clear_synthesize_token() # Clear token from both session and keyring clear_synthesize_token(remove_from_keyring = TRUE) ``` Never hard-code your token in scripts that will be shared or committed to version control. ## Basic Usage ### Available Modalities Some Synthesize models support generation of different gene expression data types. In the v2 model, you should use "bulk" for bulk gene expression. ```{r modalities} # Check available modalities get_valid_modalities() ``` ### Creating a Query The first step to generating AI-generated gene expression data is to create a query. The package provides a sample query that you can modify: ```{r query} # Get a sample query query <- get_valid_query() # Inspect the query structure str(query) ``` The query consists of: 1. `output_modality`: The type of gene expression data to generate (see `get_valid_modalities`) 2. `mode`: The prediction mode (e.g., "mean estimation" or "sample generation") 3. `inputs`: A list of biological conditions to generate data for We train our models with diverse multi-omics datasets. There are two model types/modes available today: + Sample generation: This runs in "diffusion" mode and generates different results for each sample requested. Use this mode to understand the distribution of expression across sample groups. + Mean estimation: This is deterministic. For a given metadata specification, you will get the same values. ```{r predict, eval=FALSE} # Request raw counts data result <- predict_query(query) ``` This result will be a list of two dataframes: `metadata` and `expression` ### Modifying a Query You can customize the query to fit your specific research needs: ```{r modify-query} # Change output modality query$output_modality <- "single_cell_rna-seq" # Adjust number of samples query$inputs[[1]]$num_samples <- 10 # Modify cell line information query$inputs[[1]]$metadata$cell_line <- "MCF7" query$inputs[[1]]$metadata$perturbation <- "TP53" # Add a new condition query$inputs[[3]] <- list( metadata = list( tissue = "lung", disease = "adenocarcinoma", sex = "male", age = "57 years", sample_type = "primary tissue" ), num_samples = 3 ) ``` ### Making Predictions Once your query is ready, you can send it to the API to generate gene expression data. ```{r predict-2, eval=FALSE} # Request raw counts data result <- predict_query(query, as_counts = TRUE) ``` If you want the full API response beyond just than just the result of the metadata and expression returned put `raw_response = TRUE`. ### Working with Results ```{r analyze, eval=FALSE} # Access metadata and expression matrices metadata <- result$metadata expression <- result$expression # Check dimensions dim(expression) # View metadata sample head(metadata) ``` You may want to process the data in chunks or save it for later use: ```{r large-data, eval=FALSE} # Save results to RDS file saveRDS(result, "synthesize_results.rds") # Load previously saved results result <- readRDS("synthesize_results.rds") # Export as CSV write.csv(result$expression, "expression_matrix.csv") write.csv(result$metadata, "sample_metadata.csv") ``` ### Custom Validation You can validate your queries before sending them to the API: ```{r validation} # Validate structure validate_query(query) # Validate modality validate_modality(query) ``` ## Session info ```{r session-info} sessionInfo() ``` ## Additional Resources - [Package Source Code](https://github.com/synthesizebio/rsynthbio) - [File Bug Reports](https://github.com/synthesizebio/rsynthbio/issues)