--- title: "Getting Started with mLLMCelltype" author: "Chen Yang" date: "`r Sys.Date()`" output: html_document: toc: true toc_float: true toc_depth: 3 theme: flatly vignette: > %\VignetteIndexEntry{Getting Started with mLLMCelltype} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE ) ```

# Getting Started with mLLMCelltype This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We'll cover the basic workflow, input data requirements, and a simple example to get you started. ## Basic Workflow The mLLMCelltype workflow consists of these main steps: 1. **Prepare marker gene data** for each cluster 2. **Run annotation** using one or multiple LLMs 3. **Create consensus** from multiple model predictions (optional) 4. **Integrate results** with your Seurat or Scanpy object 5. **Visualize results** with uncertainty metrics ## Loading the Package and Setting Up API Keys First, load the mLLMCelltype package: ```{r} library(mLLMCelltype) ``` ### Setting Up API Keys Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use: ```{r} # Set API keys as environment variables Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") # For Claude models Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # For GPT models Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key") # For Gemini models Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models ``` You can obtain API keys from: - Anthropic: https://console.anthropic.com/ - OpenAI: https://platform.openai.com/ - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys Alternatively, you can provide API keys directly in function calls: ```{r} results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = "your-anthropic-api-key", # Direct API key top_gene_count = 10 ) ``` ## Input Data Requirements mLLMCelltype accepts marker gene data in several formats: ### 1. Data Frame Format A data frame with the following columns: - `cluster`: Cluster ID (must be 0-based) - `gene`: Gene name/symbol - `avg_log2FC` or similar metric: Log fold change - `p_val_adj` or similar metric: Adjusted p-value Example: ```{r} # Example marker data frame markers_df <- data.frame( cluster = c(0, 0, 0, 1, 1, 1), gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"), avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5), p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005) ) ``` ### 2. Seurat FindMarkers Output You can directly use the output from Seurat's `FindAllMarkers()` function: ```{r} # Assuming you have a Seurat object named 'seurat_obj' library(Seurat) all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) ``` ### 3. CSV File Path A path to a CSV file containing marker gene data: ```{r} # Path to your CSV file markers_file <- "path/to/markers.csv" ``` ### 4. List Format A list where each element contains marker genes for a cluster: ```{r} # Example marker list markers_list <- list( "0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"), "1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A") ) ``` ## Function Parameters The `annotate_cell_types` function has the following parameters: | Parameter | Description | Default Value | |-----------|-------------|---------------| | `input` | Marker gene data (data frame, list, or file path) | (required) | | `tissue_name` | Tissue name (e.g., "human PBMC", "mouse brain") | `NULL` | | `model` | LLM model to use | `"gpt-4o"` | | `api_key` | API key (if not set in environment) | `NA` | | `top_gene_count` | Number of top genes per cluster to use | `10` | | `debug` | Whether to print debugging information | `FALSE` | Note: If `api_key` is set to `NA`, the function will return the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API. ## Basic Usage Example Here's a simple example using a single LLM model for annotation: ```{r} # Example marker data markers <- data.frame( cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1), gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"), avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0), p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002) ) # Run annotation with a single model results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10, debug = FALSE # Set to TRUE for more detailed output ) # Print results print(results) ``` ### Example Output When using a single model like Claude, the output will be a character vector with one annotation per cluster: ```r > print(results) [1] "0: T cells" "1: Monocytes" ``` ## Multi-Model Consensus Example For more reliable annotations, you can use multiple models and create a consensus: ```{r} # Define models to use models <- c( "claude-3-7-sonnet-20250219", # Anthropic "gpt-4o", # OpenAI "gemini-1.5-pro" # Google ) # API keys for different providers api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-3-7-sonnet-20250219" ) # Print consensus results print_consensus_summary(consensus_results) ``` ### Consensus Output Example The consensus results contain more detailed information: ```r > print_consensus_summary(consensus_results) Consensus Summary: ----------------- Total clusters: 2 Controversial clusters: 0 Consensus achieved for all clusters Cluster 0: Final annotation: T cells Consensus proportion: 1.0 Entropy: 0.0 Model predictions: - claude-3-7-sonnet-20250219: T cells - gpt-4o: T cells - gemini-1.5-pro: T cells Cluster 1: Final annotation: Monocytes Consensus proportion: 1.0 Entropy: 0.0 Model predictions: - claude-3-7-sonnet-20250219: Monocytes - gpt-4o: Monocytes - gemini-1.5-pro: Monocytes ``` ## Integrating with Seurat To add the annotations to your Seurat object: ```{r} # Assuming you have a Seurat object named 'seurat_obj' and consensus results library(Seurat) # Add consensus annotations to Seurat object seurat_obj$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = as.character(0:(length(consensus_results$final_annotations)-1)), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results # Note: These metrics are available in the consensus_results$initial_results$consensus_results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object seurat_obj$consensus_proportion <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = metrics_df$cluster, to = metrics_df$consensus_proportion ) # Add entropy to Seurat object seurat_obj$entropy <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = metrics_df$cluster, to = metrics_df$entropy ) ``` ## Basic Visualization Here's a simple visualization of the results using Seurat: ```{r} # Plot UMAP with cell type annotations DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) + ggtitle("Cell Type Annotations") + theme(plot.title = element_text(hjust = 0.5)) ``` ## Understanding the Output The output of `annotate_cell_types()` is a vector of cell type annotations, where each element corresponds to a cluster. The output of `interactive_consensus_annotation()` is a list containing: - `final_annotations`: Final consensus cell type annotations - `initial_results`: Initial predictions from each model - `controversial_clusters`: List of clusters that required discussion - `discussion_logs`: Detailed logs of the discussion process - `session_id`: Unique identifier for the annotation session ### Understanding Uncertainty Metrics When using consensus annotation, two key metrics help evaluate the reliability of annotations: - **Consensus Proportion**: Ranges from 0 to 1, indicating the proportion of models that agree on the final annotation. Higher values indicate stronger agreement. - **Entropy**: Measures the uncertainty in model predictions. Lower values indicate more certainty. An entropy of 0 means all models agree perfectly. Clusters with low consensus proportion or high entropy may require manual review. ## Using OpenRouter Free Models If you don't have access to paid API keys, you can use OpenRouter's free models: ```{r} # Set OpenRouter API key Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Use a free model free_results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "meta-llama/llama-4-maverick:free", # Note the :free suffix api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 10 ) # Print results print(free_results) ``` Available free models include: - `meta-llama/llama-4-maverick:free` - Meta Llama 4 Maverick (256K context) - `nvidia/llama-3.1-nemotron-ultra-253b-v1:free` - NVIDIA Nemotron Ultra 253B - `deepseek/deepseek-chat-v3-0324:free` - DeepSeek Chat v3 - `microsoft/mai-ds-r1:free` - Microsoft MAI-DS-R1 Free models don't consume credits but may have limitations compared to paid models. ## Troubleshooting ### Common Issues 1. **API Key Not Found**: ```r Error: No auth credentials found ``` **Solution**: Ensure you've set the correct API key environment variable or provided it directly in the function call. 2. **Rate Limiting**: ```r Error: Rate limit exceeded ``` **Solution**: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once. 3. **Invalid Model Name**: ```r Error: Unsupported model: [model_name] ``` **Solution**: Check that you're using a supported model name and that it's spelled correctly. 4. **Network Issues**: ```r Error: Could not connect to API ``` **Solution**: Check your internet connection and try again. If the problem persists, the API service might be down. ## Next Steps Now that you understand the basics of mLLMCelltype, you can explore: - [Usage Tutorial](usage-tutorial.html): More detailed usage examples - [Visualization Guide](visualization-guide.html): Create publication-ready visualizations - [Introduction](introduction.html): Learn about mLLMCelltype principles and background If you encounter any issues, please [open an issue](https://github.com/cafferychen777/mLLMCelltype/issues) on our GitHub repository.