--- title: "Real PubMed Search Analysis with searchAnalyzeR" subtitle: "A Comprehensive Example Using COVID-19 Long-term Effects" author: "searchAnalyzeR Development Team" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true vignette: > %\VignetteIndexEntry{Real PubMed Search Analysis with searchAnalyzeR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 3, fig.height = 3, eval = TRUE ) ``` # Introduction This vignette demonstrates how to use the `searchAnalyzeR` package to conduct a comprehensive analysis of systematic review search strategies using real PubMed data. We'll walk through a complete workflow for analyzing search performance, from executing searches to generating publication-ready reports. The `searchAnalyzeR` package provides tools for: - **Search execution and standardization** across multiple databases - **Duplicate detection and removal** using sophisticated algorithms - **Performance metric calculation** including precision, recall, and F1 scores - **Visualization generation** for search strategy assessment - **Export capabilities** in multiple formats (CSV, Excel, RIS) - **PRISMA diagram creation** for systematic review reporting - **Term effectiveness analysis** to optimize search strategies ## Case Study: COVID-19 Long-term Effects For this demonstration, we'll analyze a search strategy designed to identify literature on the long-term effects of COVID-19, commonly known as "Long COVID." This topic represents a rapidly evolving area of research that presents typical challenges faced in systematic reviews. # Getting Started ## Required Packages First, let's load the required packages for our analysis: ```{r load-packages, message=FALSE, warning=FALSE} # Load required packages library(searchAnalyzeR) library(rentrez) # For PubMed API access library(xml2) # For XML parsing library(dplyr) library(ggplot2) library(lubridate) cat("=== searchAnalyzeR: Real PubMed Search Example ===\n") cat("Topic: Long-term effects of COVID-19 (Long COVID)\n") cat("Objective: Demonstrate search strategy analysis with real data\n\n") ``` ## Defining the Search Strategy A well-defined search strategy is crucial for systematic reviews. Here we define our search parameters including terms, databases, date ranges, and filters: ```{r define-strategy} # Define our search strategy search_strategy <- list( terms = c( "long covid", "post-covid syndrome", "covid-19 sequelae", "post-acute covid-19", "persistent covid symptoms" ), databases = c("PubMed"), date_range = as.Date(c("2020-01-01", "2024-12-31")), filters = list( language = "English", article_types = c("Journal Article", "Review", "Clinical Trial") ), search_date = Sys.time() ) cat("Search Strategy:\n") cat("Terms:", paste(search_strategy$terms, collapse = " OR "), "\n") cat("Date range:", paste(search_strategy$date_range, collapse = " to "), "\n\n") ``` The search strategy includes: - **Search terms**: A comprehensive list of synonyms and related terms for Long COVID - **Date range**: Covering the period from the start of the pandemic through 2024 - **Language filter**: English language publications only - **Article types**: Focusing on primary research, reviews, and clinical trials # Executing the Search ## PubMed Search The `searchAnalyzeR` package provides convenient functions to search PubMed and retrieve article metadata: ```{r execute-search} # Execute the search using the package function cat("Searching PubMed for real articles...\n") raw_results <- search_pubmed( search_terms = search_strategy$terms, max_results = 150, date_range = search_strategy$date_range, language = "English" ) cat("\nRaw search completed. Retrieved", nrow(raw_results), "articles.\n") ``` ## Data Standardization Raw search results from different databases often have varying formats. The `std_search_results()` function standardizes the data structure: ```{r standardize-results} # Standardize the results using searchAnalyzeR functions cat("\nStandardizing search results...\n") standardized_results <- std_search_results(raw_results, source_format = "pubmed") ``` This standardization ensures that: - Field names are consistent across different data sources - Date formats are properly parsed - Missing values are handled appropriately - Data types are correctly assigned # Data Quality Assessment ## Duplicate Detection Duplicate detection is critical in systematic reviews, especially when searching multiple databases. The package provides sophisticated algorithms for identifying duplicates: ```{r duplicate-detection} # Detect and remove duplicates cat("Detecting duplicates...\n") dedup_results <- detect_dupes(standardized_results, method = "exact") cat("Duplicate detection complete:\n") cat("- Total articles:", nrow(dedup_results), "\n") cat("- Unique articles:", sum(!dedup_results$duplicate), "\n") cat("- Duplicates found:", sum(dedup_results$duplicate), "\n\n") ``` The `detect_dupes()` function offers several methods: - **exact**: Matches identical titles and authors - **fuzzy**: Uses string similarity algorithms - **doi**: Matches based on Digital Object Identifiers - **combined**: Uses multiple criteria for robust detection ## Search Statistics Basic statistics help assess the overall quality of the search results: ```{r search-stats} # Calculate search statistics search_stats <- calc_search_stats(dedup_results) cat("Search Statistics:\n") cat("- Date range:", paste(search_stats$date_range, collapse = " to "), "\n") cat("- Missing abstracts:", search_stats$missing_abstracts, "\n") cat("- Missing dates:", search_stats$missing_dates, "\n\n") ``` # Creating a Gold Standard ## Demonstration Gold Standard For performance evaluation, we need a "gold standard" of known relevant articles. In a real systematic review, this would be your manually identified relevant articles. For this demonstration, we create a simplified gold standard: ```{r gold-standard} # Create a gold standard for demonstration # In a real systematic review, this would be your known relevant articles # For this example, we'll identify articles that contain key terms in titles cat("Creating demonstration gold standard...\n") long_covid_terms <- c("long covid", "post-covid", "post-acute covid", "persistent covid", "covid sequelae") pattern <- paste(long_covid_terms, collapse = "|") gold_standard_ids <- dedup_results %>% filter(!duplicate) %>% filter(grepl(pattern, tolower(title))) %>% pull(id) cat("Gold standard created with", length(gold_standard_ids), "highly relevant articles\n\n") ``` **Note**: In practice, your gold standard would be created through: - Expert knowledge of key articles in the field - Pilot searches and manual review - Previously published systematic reviews - Consultation with domain experts # Performance Analysis ## Initializing the SearchAnalyzer The `SearchAnalyzer` class provides comprehensive tools for evaluating search performance: ```{r initialize-analyzer} # Initialize SearchAnalyzer with real data cat("Initializing SearchAnalyzer...\n") analyzer <- SearchAnalyzer$new( search_results = filter(dedup_results, !duplicate), gold_standard = gold_standard_ids, search_strategy = search_strategy ) ``` ## Calculating Performance Metrics The analyzer calculates a comprehensive set of performance metrics: ```{r calculate-metrics} # Calculate comprehensive metrics cat("Calculating performance metrics...\n") metrics <- analyzer$calculate_metrics() # Display key metrics cat("\n=== SEARCH PERFORMANCE METRICS ===\n") if (!is.null(metrics$precision_recall$precision)) { cat("Precision:", round(metrics$precision_recall$precision, 3), "\n") cat("Recall:", round(metrics$precision_recall$recall, 3), "\n") cat("F1 Score:", round(metrics$precision_recall$f1_score, 3), "\n") cat("Number Needed to Read:", round(metrics$precision_recall$number_needed_to_read, 1), "\n") } cat("\n=== BASIC METRICS ===\n") cat("Total Records:", metrics$basic$total_records, "\n") cat("Unique Records:", metrics$basic$unique_records, "\n") cat("Duplicates:", metrics$basic$duplicates, "\n") cat("Sources:", metrics$basic$sources, "\n") ``` ### Understanding the Metrics - **Precision**: Proportion of retrieved articles that are relevant - **Recall**: Proportion of relevant articles that are retrieved - **F1 Score**: Harmonic mean of precision and recall - **Number Needed to Read**: Average number of articles to screen to find one relevant article # Visualization and Reporting ## Performance Visualizations The package generates publication-ready visualizations to assess search performance: ```{r visualizations, fig.width=8, fig.height=6} # Generate visualizations cat("\nGenerating visualizations...\n") # Overview plot overview_plot <- analyzer$visualize_performance("overview") print(overview_plot) # Temporal distribution plot temporal_plot <- analyzer$visualize_performance("temporal") print(temporal_plot) # Precision-recall curve (if gold standard available) if (length(gold_standard_ids) > 0) { pr_plot <- analyzer$visualize_performance("precision_recall") print(pr_plot) } ``` ## PRISMA Flow Diagram The package can generate data for PRISMA flow diagrams, essential for systematic review reporting: ```{r prisma-setup} # Generate PRISMA flow diagram data cat("\nCreating PRISMA flow data...\n") screening_data <- data.frame( id = dedup_results$id[!dedup_results$duplicate], identified = TRUE, duplicate = FALSE, title_abstract_screened = TRUE, full_text_eligible = runif(sum(!dedup_results$duplicate)) > 0.7, # Simulate screening included = runif(sum(!dedup_results$duplicate)) > 0.85, # Simulate final inclusion excluded_title_abstract = runif(sum(!dedup_results$duplicate)) > 0.3, excluded_full_text = runif(sum(!dedup_results$duplicate)) > 0.15 ) ``` ```{r prisma-diagram, fig.width=10, fig.height=8} # Generate PRISMA diagram reporter <- PRISMAReporter$new() prisma_plot <- reporter$generate_prisma_diagram(screening_data) print(prisma_plot) ``` # Data Export and Sharing ## Multiple Export Formats The package supports exporting results in various formats commonly used in systematic reviews: ```{r export-results} # Export results in multiple formats cat("\nExporting results...\n") output_dir <- tempdir() export_files <- export_results( search_results = filter(dedup_results, !duplicate), file_path = file.path(output_dir, "covid_long_term_search"), formats = c("csv", "xlsx", "ris"), include_metadata = TRUE ) cat("Files exported:\n") for (file in export_files) { cat("-", file, "\n") } ``` ## Metrics Export Performance metrics can also be exported for further analysis or reporting: ```{r export-metrics} # Export metrics metrics_file <- export_metrics( metrics = metrics, file_path = file.path(output_dir, "search_metrics.xlsx"), format = "xlsx" ) cat("- Metrics exported to:", metrics_file, "\n") ``` ## Comprehensive Data Package For reproducibility, create a complete data package containing all analysis components: ```{r data-package} # Create a complete data package cat("\nCreating comprehensive data package...\n") package_dir <- create_data_package( search_results = filter(dedup_results, !duplicate), analysis_results = list( metrics = metrics, search_strategy = search_strategy, screening_data = screening_data ), output_dir = output_dir, package_name = "covid_long_term_systematic_review" ) cat("Data package created at:", package_dir, "\n") ``` # Advanced Analysis ## Benchmark Validation The package includes tools for validating search strategies against established benchmarks: ```{r benchmark-validation} # Demonstrate benchmark validation (simplified) cat("\nDemonstrating benchmark validation...\n") validator <- BenchmarkValidator$new() # Add our search as a custom benchmark validator$add_benchmark( name = "covid_long_term", corpus = filter(dedup_results, !duplicate), relevant_ids = gold_standard_ids ) # Validate the strategy validation_results <- validator$validate_strategy( search_strategy = search_strategy, benchmark_name = "covid_long_term" ) cat("Validation Results:\n") cat("- Precision:", round(validation_results$precision, 3), "\n") cat("- Recall:", round(validation_results$recall, 3), "\n") cat("- F1 Score:", round(validation_results$f1_score, 3), "\n") ``` ## Text Similarity Analysis Analyze how well the retrieved abstracts match the search terms: ```{r text-similarity} # Text similarity analysis on abstracts cat("\nAnalyzing abstract similarity to search terms...\n") search_term_text <- paste(search_strategy$terms, collapse = " ") similarity_scores <- sapply(dedup_results$abstract[!dedup_results$duplicate], function(abstract) { if (is.na(abstract) || abstract == "") return(0) calc_text_sim(search_term_text, abstract, method = "jaccard") }) cat("Average abstract similarity to search terms:", round(mean(similarity_scores, na.rm = TRUE), 3), "\n") cat("Abstracts with high similarity (>0.1):", sum(similarity_scores > 0.1, na.rm = TRUE), "\n") ``` ## Term Effectiveness Analysis Evaluate which search terms are most effective: ```{r term-analysis} # Analyze term effectiveness cat("\nAnalyzing individual term effectiveness...\n") term_analysis <- term_effectiveness( terms = search_strategy$terms, search_results = filter(dedup_results, !duplicate), gold_standard = gold_standard_ids ) print(term_analysis) ``` ```{r term-scores} # Calculate term effectiveness scores term_scores <- calc_tes(term_analysis) cat("\nTerm Effectiveness Scores (TES):\n") print(term_scores[order(term_scores$tes, decreasing = TRUE), ]) ``` ```{r top-terms, fig.width=8, fig.height=6} # Find top performing terms top_terms <- find_top_terms(term_analysis, n = 3, plot = TRUE, plot_type = "precision_coverage") cat("\nTop 3 performing terms:", paste(top_terms$terms, collapse = ", "), "\n") if (!is.null(top_terms$plot)) { print(top_terms$plot) } ``` # Results Interpretation and Recommendations ## Automated Recommendations Based on the calculated metrics, the package can provide automated recommendations: ```{r recommendations} # Final summary and recommendations cat("\n=== FINAL SUMMARY AND RECOMMENDATIONS ===\n") cat("Search Topic: Long-term effects of COVID-19\n") cat("Articles Retrieved:", sum(!dedup_results$duplicate), "\n") cat("Search Date Range:", paste(search_strategy$date_range, collapse = " to "), "\n") if (!is.null(metrics$precision_recall$precision)) { cat("Search Precision:", round(metrics$precision_recall$precision, 3), "\n") if (metrics$precision_recall$precision < 0.1) { cat("RECOMMENDATION: Low precision suggests search may be too broad. Consider:\n") cat("- Adding more specific terms\n") cat("- Using MeSH terms\n") cat("- Adding study type filters\n") } else if (metrics$precision_recall$precision > 0.5) { cat("RECOMMENDATION: High precision suggests good specificity. Consider:\n") cat("- Broadening search if recall needs improvement\n") cat("- Adding synonyms or related terms\n") } } ``` ## Sample Retrieved Articles Let's examine some of the retrieved articles to understand the search results: ```{r sample-articles} # Show some example retrieved articles cat("\n=== SAMPLE RETRIEVED ARTICLES ===\n") sample_articles <- filter(dedup_results, !duplicate) %>% arrange(desc(date)) %>% head(3) for (i in 1:nrow(sample_articles)) { article <- sample_articles[i, ] cat("\n", i, ". ", article$title, "\n", sep = "") cat(" Journal:", article$source, "\n") cat(" Date:", as.character(article$date), "\n") cat(" PMID:", gsub("PMID:", "", article$id), "\n") cat(" Abstract:", substr(article$abstract, 1, 200), "...\n") } ``` # Summary and Next Steps ## What We've Accomplished This vignette demonstrated a complete workflow using the `searchAnalyzeR` package: ```{r summary} cat("\n=== ANALYSIS COMPLETE ===\n") cat("This example demonstrated:\n") cat("1. Real PubMed search execution using search_pubmed()\n") cat("2. Data standardization and deduplication\n") cat("3. Performance metric calculation\n") cat("4. Visualization generation\n") cat("5. Multi-format export capabilities\n") cat("6. PRISMA diagram creation\n") cat("7. Benchmark validation\n") cat("8. Term effectiveness analysis\n") cat("9. Comprehensive reporting\n") cat("\nAll outputs saved to:", output_dir, "\n") ``` ## Files Generated The analysis generates numerous output files for different purposes: ```{r list-files} # Clean up and provide final file locations list.files(output_dir, pattern = "covid", full.names = TRUE, recursive = TRUE) ``` ## Next Steps After completing this analysis, typical next steps would include: 1. **Refining the search strategy** based on performance metrics and term effectiveness analysis 2. **Expanding to multiple databases** using similar workflows 3. **Implementing the refined strategy** in your systematic review protocol 4. **Using the exported data** for screening and data extraction phases 5. **Incorporating visualizations** into your systematic review protocol or publication ## Additional Resources For more advanced features and customization options, consult: - The package documentation: `help(package = "searchAnalyzeR")` - Function-specific help: `?search_pubmed`, `?SearchAnalyzer`, etc. - Additional vignettes covering specialized topics - The package GitHub repository for updates and community contributions This comprehensive workflow demonstrates how `searchAnalyzeR` can streamline and enhance the systematic review search process, providing objective metrics and visualizations to support evidence-based search strategy development and optimization.