--- title: "Usage Tutorial" author: "Chen Yang" date: "`r Sys.Date()`" output: html_document: toc: true toc_float: true toc_depth: 3 theme: flatly vignette: > %\VignetteIndexEntry{Usage Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE ) ``` # Usage Tutorial This tutorial provides detailed instructions for using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We'll cover various usage scenarios, parameter configurations, and integration with Seurat. ## Comprehensive Function Parameters ### annotate_cell_types() The main function for cell type annotation with a single model: ```{r} library(mLLMCelltype) results <- annotate_cell_types( input, # Marker gene data (data frame, list, or file path) tissue_name, # Tissue name (e.g., "human PBMC", "mouse brain") model, # LLM model to use api_key = NA, # API key (if not set in environment, NA returns prompt only) top_gene_count = 10, # Number of top genes per cluster to use debug = FALSE # Whether to print debugging information ) ``` ### interactive_consensus_annotation() Function for creating consensus annotations from multiple models through interactive discussion: ```{r eval=FALSE} consensus_results <- interactive_consensus_annotation( input, # Original marker gene data (Seurat FindAllMarkers result or list of genes) tissue_name = NULL, # Optional tissue name models = c("claude-3-7-sonnet-20250219", "gpt-4o", "gemini-1.5-pro"), # Models to use api_keys, # Named list of API keys top_gene_count = 10, # Number of top genes to use controversy_threshold = 0.7, # Threshold for identifying controversial clusters entropy_threshold = 1.0, # Entropy threshold for controversial clusters max_discussion_rounds = 3, # Maximum discussion rounds consensus_check_model = NULL, # Model to use for consensus checking (see recommendations below) log_dir = file.path(tempdir(), "mLLMCelltype_logs"), # Directory for logs (using tempdir) cache_dir = file.path(tempdir(), "mLLMCelltype_cache"), # Directory for cache (using tempdir) use_cache = TRUE # Whether to use cache ) ``` **Important Note on `consensus_check_model`**: This parameter is critical for annotation accuracy. The consensus check model evaluates semantic similarity, calculates consensus metrics, and moderates discussions. We strongly recommend using high-performance models like: - `claude-opus-4-20250514` (best overall) - `claude-3-7-sonnet-20250219` (excellent balance) - `o1` (advanced reasoning) - `gpt-4o` (strong performance) - `gemini-2.5-pro` (top-tier) See the main README for detailed recommendations and examples. ## Detailed Usage Scenarios ### Scenario 1: Basic Annotation with a Single Model For quick exploration or when API usage is a concern: ```{r} # Load example data library(Seurat) data("pbmc_small") # Find markers pbmc_markers <- FindAllMarkers(pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Run annotation with a single model results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10 ) # Add annotations to Seurat object pbmc_small$cell_type_claude <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = as.character(0:(length(results)-1)), to = results ) # Visualize DimPlot(pbmc_small, group.by = "cell_type_claude", label = TRUE) ``` ### Scenario 2: Multi-Model Consensus for High Accuracy For publication-quality annotations with uncertainty quantification: ```{r} # Define multiple models to use models <- c( "claude-3-7-sonnet-20250219", # Anthropic "gpt-4o", # OpenAI "gemini-1.5-pro", # Google "grok-3" # X.AI ) # API keys for different providers api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY"), grok = Sys.getenv("GROK_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-3-7-sonnet-20250219" ) # View consensus results # You can access the final annotations with consensus_results$final_annotations # Add consensus annotations and metrics to Seurat object pbmc_small$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = as.character(0:(length(consensus_results$final_annotations)-1)), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object pbmc_small$consensus_proportion <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$consensus_proportion ) # Add entropy to Seurat object pbmc_small$shannon_entropy <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$entropy ) ``` ### Scenario 2b: Using Free OpenRouter Models For users with limited API credits or budget constraints: ```{r} # Set OpenRouter API key openrouter_api_key <- Sys.getenv("OPENROUTER_API_KEY") # Define free OpenRouter models to use free_models <- c( "meta-llama/llama-4-maverick:free", # Meta Llama 4 Maverick (free) "nvidia/llama-3.1-nemotron-ultra-253b-v1:free", # NVIDIA Nemotron Ultra 253B (free) "deepseek/deepseek-chat-v3-0324:free", # DeepSeek Chat v3 (free) "microsoft/mai-ds-r1:free" # Microsoft MAI-DS-R1 (free) ) # Run annotation with free OpenRouter models free_results <- list() for (model in free_models) { free_results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, # OpenRouter models are automatically detected by format: 'provider/model-name:free' api_key = openrouter_api_key, top_gene_count = 10 ) } # Create consensus with free models free_consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = free_models, # Use all the free models defined above api_keys = list("openrouter" = openrouter_api_key), controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "meta-llama/llama-4-maverick:free" # Use a free model for consensus checking ) # View free model consensus results # You can access the final annotations with free_consensus_results$final_annotations # Add free model consensus annotations to Seurat object pbmc_small$free_model_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = as.character(0:(length(free_consensus_results$final_annotations)-1)), to = free_consensus_results$final_annotations ) # Compare paid vs. free model results comparison <- data.frame( cluster = as.character(0:(length(consensus_results$final_annotations)-1)), paid_models = consensus_results$final_annotations, free_models = free_consensus_results$final_annotations, agreement = consensus_results$final_annotations == free_consensus_results$final_annotations ) print(comparison) ``` ### Scenario 3: Working with CSV Files For users who prefer working with files: ```{r} # Save markers to CSV write.csv(pbmc_markers, "pbmc_markers.csv", row.names = FALSE) # Run annotation using the CSV file results <- annotate_cell_types( input = "pbmc_markers.csv", tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY") ) ``` ### Scenario 4: Custom Caching For better control over caching behavior: ```{r} # Note: The annotate_cell_types function does not have built-in caching. # If you need caching, you can implement it separately. # Run annotation results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10, debug = FALSE ) # If you need custom caching, you can implement it using your own cache manager # This is just a conceptual example and not part of the actual package # cache_manager <- YourCacheManager$new(cache_dir = file.path(tempdir(), "cache")) # cache_manager$clear_cache() ``` ## Model Selection Guide mLLMCelltype supports a wide range of LLM models. Here's a guide to help you choose: ### High Performance Models For the most accurate annotations: - **Anthropic Claude 3.7 Sonnet** (`claude-3-7-sonnet-20250219`): Excellent biological knowledge, best for discussion - **OpenAI GPT-4o** (`gpt-4o`): Strong overall performance, good biological knowledge - **Google Gemini 1.5 Pro** (`gemini-1.5-pro`): Good performance with detailed reasoning ### Balanced Performance/Cost Models For good results with lower API costs: - **Anthropic Claude 3.5 Sonnet** (`claude-3-5-sonnet-20240620`): Good balance of performance and cost - **X.AI Grok-3** (`grok-3`): Competitive performance at lower cost - **DeepSeek V3** (`deepseek-v3`): Good performance for specialized tissues ### Economy Models For preliminary exploration or large datasets: - **Qwen 2.5** (`qwen-max-2025-01-25`): Good performance for the cost - **Zhipu GLM-4** (`glm-4`): Economical option with decent performance - **MiniMax** (`minimax`): Cost-effective for initial exploration ### Free Models via OpenRouter For users with limited API credits or budget constraints: - **Meta Llama 4 Maverick** (`meta-llama/llama-4-maverick:free`): Most reliable and fast, recommended for consensus checking - **NVIDIA Nemotron Ultra 253B** (`nvidia/llama-3.1-nemotron-ultra-253b-v1:free`): Good performance with consistent formatting - **Microsoft MAI-DS-R1** (`microsoft/mai-ds-r1:free`): Reliable with good response time - **DeepSeek Chat v3** (`deepseek/deepseek-chat-v3-0324:free`): Free model with 163K context window (may occasionally return empty results) Based on our testing, we recommend the following free models: - `meta-llama/llama-4-maverick:free`: Most reliable and fast, recommended for consensus checking - `nvidia/llama-3.1-nemotron-ultra-253b-v1:free`: Good performance with consistent formatting - `microsoft/mai-ds-r1:free`: Reliable with good response time Some models may have limitations: - `deepseek/deepseek-chat-v3-0324:free`: May occasionally return empty results - `thudm/glm-z1-9b:free`: May return Chinese error messages ("非字符参数") when used for consensus checking These free models are accessed through OpenRouter and don't consume credits, but may have limitations compared to paid models. Use the `:free` suffix in the model name to access them. ```{r} # Example of using a free model via OpenRouter # First, set your OpenRouter API key Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Then use a free model with the :free suffix free_model_results <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "meta-llama/llama-4-maverick:free", # Note the :free suffix api_key = Sys.getenv("OPENROUTER_API_KEY") # No need to specify provider - it's automatically detected from the model name format ) ``` ## Integration with Seurat Workflow Here's a complete example of integrating mLLMCelltype with a Seurat workflow: ```{r} library(Seurat) library(mLLMCelltype) library(ggplot2) # Load data data("pbmc_small") # Standard Seurat preprocessing pbmc_small <- NormalizeData(pbmc_small) pbmc_small <- FindVariableFeatures(pbmc_small) pbmc_small <- ScaleData(pbmc_small) pbmc_small <- RunPCA(pbmc_small) pbmc_small <- FindNeighbors(pbmc_small) pbmc_small <- FindClusters(pbmc_small, resolution = 0.5) pbmc_small <- RunUMAP(pbmc_small, dims = 1:10) # Find markers for each cluster pbmc_markers <- FindAllMarkers(pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # Define models to use models <- c( "claude-3-7-sonnet-20250219", "gpt-4o", "gemini-1.5-pro" ) # API keys api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) # Add individual model results to Seurat object column_name <- paste0("cell_type_", gsub("[^a-zA-Z0-9]", "_", model)) pbmc_small[[column_name]] <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = as.character(0:(length(results[[model]])-1)), to = results[[model]] ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-3-7-sonnet-20250219" ) # Add consensus results to Seurat object pbmc_small$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = as.character(0:(length(consensus_results$final_annotations)-1)), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object pbmc_small$consensus_proportion <- as.numeric(plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$consensus_proportion )) # Add entropy to Seurat object pbmc_small$shannon_entropy <- as.numeric(plyr::mapvalues( x = as.character(Idents(pbmc_small)), from = metrics_df$cluster, to = metrics_df$entropy )) # Visualize results p1 <- DimPlot(pbmc_small, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) + ggtitle("Cell Type Annotations") + theme(plot.title = element_text(hjust = 0.5)) p2 <- FeaturePlot(pbmc_small, features = "consensus_proportion", cols = c("yellow", "green", "blue")) + ggtitle("Consensus Proportion") + theme(plot.title = element_text(hjust = 0.5)) p3 <- FeaturePlot(pbmc_small, features = "shannon_entropy", cols = c("red", "orange")) + ggtitle("Shannon Entropy") + theme(plot.title = element_text(hjust = 0.5)) # Combine plots p1 | p2 | p3 ``` ## Advanced Parameter Tuning ### Adjusting top_gene_count The `top_gene_count` parameter controls how many top marker genes per cluster are used for annotation: ```{r} # Using more genes (better for well-characterized tissues) results_more_genes <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 20 # Using more genes ) # Using fewer genes (better for noisy data) results_fewer_genes <- annotate_cell_types( input = pbmc_markers, tissue_name = "human PBMC", model = "claude-3-7-sonnet-20250219", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 5 # Using fewer genes ) ``` ### Adjusting controversy_threshold The `controversy_threshold` parameter in the `interactive_consensus_annotation` function controls which clusters are considered controversial and require discussion: ```{r eval=FALSE} # Example of using interactive_consensus_annotation with different controversy thresholds # Lower threshold (more clusters will be discussed) consensus_results_low_threshold <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = c("claude-3-7-sonnet-20250219", "gpt-4o", "gemini-2.0-flash"), api_keys = list( "anthropic" = Sys.getenv("ANTHROPIC_API_KEY"), "openai" = Sys.getenv("OPENAI_API_KEY"), "gemini" = Sys.getenv("GEMINI_API_KEY") ), controversy_threshold = 0.3 # Lower threshold - more clusters will be discussed ) # Higher threshold (fewer clusters will be discussed) consensus_results_high_threshold <- interactive_consensus_annotation( input = pbmc_markers, tissue_name = "human PBMC", models = c("claude-3-7-sonnet-20250219", "gpt-4o", "gemini-2.0-flash"), api_keys = list( "anthropic" = Sys.getenv("ANTHROPIC_API_KEY"), "openai" = Sys.getenv("OPENAI_API_KEY"), "gemini" = Sys.getenv("GEMINI_API_KEY") ), controversy_threshold = 0.7 # Higher threshold - fewer clusters will be discussed ) ``` ## Performance Considerations ### API Rate Limits and Costs Different LLM providers have different rate limits and pricing: - **OpenAI**: Higher rate limits but can be more expensive - **Anthropic**: Good balance of rate limits and cost - **Google**: Competitive pricing with good rate limits - **Others**: Generally lower cost but may have stricter rate limits To manage costs and rate limits: 1. Use caching to avoid redundant API calls 2. Start with a single model for exploration 3. Use more economical models for initial testing 4. Reserve multi-model consensus for final analysis ### Execution Time Typical execution times: - Single model annotation: 5-30 seconds per cluster - Multi-model consensus: 1-5 minutes for a typical dataset - Discussion process: Additional 1-3 minutes per controversial cluster To improve performance: 1. Use a smaller `top_gene_count` for faster execution 2. Enable caching to reuse results 3. Use a higher `controversy_threshold` to reduce the number of clusters that require discussion ## Troubleshooting ### Common Issues with OpenRouter If you encounter "No auth credentials found" errors with OpenRouter: - Verify your API key is correct and active - Ensure you're using the correct format for the model name (e.g., `provider/model-name:free`) - Try a different free model, as some models may have access restrictions ### Chinese Error Messages If you see "非字符参数" (non-character parameter) errors: - This typically occurs when using Chinese models like `thudm/glm-z1-9b:free` for consensus checking - Switch to an English model like `meta-llama/llama-4-maverick:free` for consensus checking ### Empty Results If a model returns empty results: - Try increasing the timeout or retry with the same model - Some models like `deepseek/deepseek-chat-v3-0324:free` may occasionally return empty results - Switch to a more reliable model if the problem persists ## Next Steps Now that you understand the detailed usage of mLLMCelltype, you can explore: - [Introduction](introduction.html): Learn about the technical principles and background - [Visualization Guide](visualization-guide.html): Create publication-ready visualizations - [Getting Started Guide](getting-started.html): Basic usage examples