--- title: "Example Workflow for Single-Cell Annotation with easybio" author: "[cw](https://cying.org)" date: "`{r} Sys.Date()`" output: html vignette: > %\VignetteEngine{litedown::vignette} %\VignetteIndexEntry{Example Workflow for Single-Cell Annotation with easybio} %\VignetteEncoding{UTF-8} --- ## Introduction This vignette demonstrates the powerful and intuitive workflow for single-cell RNA-seq annotation provided by the `easybio` package. The process is designed to combine the speed of automated database matching with the reliability of interactive verification and manual curation. The core workflow follows three logical steps: 1. **Automated Annotation**: Use `matchCellMarker2()` to quickly get a list of potential cell types for each cluster based on its marker genes. 2. **Verification & Exploration**: Interactively investigate the automated results using `check_marker()` and `plotSeuratDot()` to build confidence in the annotations. This step helps answer two critical questions: * "**Why** was this annotation made?" (Which of my genes matched the database?) * "Is this annotation **correct**?" (Are the canonical markers for this cell type expressed in my cluster?) 3. **Final Curation**: Based on the evidence gathered, use `finsert()` to assign the final, high-confidence cell type labels. You can also view the R script for this workflow by running: `fs::file_show(system.file(package = 'easybio', 'example-single-cell.R'))` ## Setup First, let's load the necessary libraries and the example marker data included with `easybio`. This data is derived from the 10x Genomics 3k PBMC dataset. ```{r} litedown::reactor(warning = FALSE) # vignette setting library(easybio) library(Seurat) library(data.table) # The pbmc.markers dataset is included in easybio head(pbmc.markers) ``` ## Step 1: Automated Annotation with `matchCellMarker2` We begin by feeding the cluster markers (from `Seurat::FindAllMarkers`) into `matchCellMarker2()`. This function compares our markers against the CellMarker2.0 database and returns a ranked list of potential cell types for each cluster. ```{r} marker_matched <- matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human") # Let's look at the top 2 potential cell types for each cluster marker_matched[, head(.SD, 2), by = cluster] ``` The output table gives us `uniqueN` (the number of unique matching markers) and `N` (the total number of matches), which helps rank the potential annotations. We can create a quick preliminary annotation by taking the top hit for each cluster. ```{r} cl2cell_auto <- marker_matched[, head(.SD, 1), by = .(cluster)] cl2cell_auto <- setNames(cl2cell_auto[["cell_name"]], cl2cell_auto[["cluster"]]) print("Initial automated annotation:") cl2cell_auto ``` We can also get a global view of all possible annotations using `plotPossibleCell`. ```{r} #| fig.width=10 plotPossibleCell(marker_matched[, head(.SD), by = .(cluster)], min.uniqueN = 2) ``` ## Step 2: Verification and Exploration This is the most critical step. Instead of blindly trusting the automated result, we use `easybio`'s tools to verify it. ### Answering "Why was this annotation made?" To see the evidence behind an annotation, we use `check_marker()` with `cis = TRUE`. This shows us which of **our own marker genes** from our data matched the database for a given annotation. ```{r} # Let's investigate clusters 1, 5, and 7 local_evidence <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = TRUE) print(local_evidence) ``` ### Answering "Is this annotation correct?" To validate an annotation, we use `check_marker()` with `cis = FALSE` (the default). This fetches the **canonical markers** for the suggested cell type from the database. We can then check if these well-known markers are expressed in our cluster. ```{r} canonical_markers <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = FALSE) print(canonical_markers) ``` ### Visual Confirmation with `plotSeuratDot` The best way to check marker expression is visually. `plotSeuratDot` is designed to work seamlessly with `check_marker`. The entire pipeline from annotation to visualization can be done in a single, elegant pipe: ```{r, fig.width=9, fig.height=5} # For this example to be runnable, we need a Seurat object. # We'll create a minimal one. In your real workflow, you would use your own srt object. marker_genes <- unique(pbmc.markers$gene) counts <- matrix( abs(rnorm(length(marker_genes) * 50, mean = 1, sd = 2)), nrow = length(marker_genes), ncol = 50 ) rownames(counts) <- marker_genes colnames(counts) <- paste0("cell_", 1:50) srt <- CreateSeuratObject(counts = counts) # Assign clusters that match the pbmc.markers data srt$seurat_clusters <- sample(0:8, 50, replace = TRUE) Idents(srt) <- "seurat_clusters" # Now, let's plot the evidence for clusters 1, 5, and 7 matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human") |> check_marker(cl = c(1, 5, 7), topcellN = 2, cis = TRUE) |> plotSeuratDot(srt = srt) ``` This dot plot clearly shows the expression of the genes that led to the annotations for clusters 1, 5, and 7, allowing us to confidently assess the results. ## Step 3: Final Manual Curation After reviewing the evidence from the dot plots, we can make our final, informed decision. The `finsert` function provides a convenient way to create the final annotation vector. ```{r} # Based on our exploration, we finalize the annotations cl2cell_final <- finsert( list( c(3) ~ "B cell", c(8) ~ "Megakaryocyte", c(7) ~ "DC", c(1, 5) ~ "Monocyte", c(0, 2, 4) ~ "Naive CD8+ T cell", c(6) ~ "Natural killer cell" ), len = 9 # Ensure vector length covers all clusters (0-8) ) print("Final curated annotation:") cl2cell_final ``` This `cl2cell_final` vector can now be added to your Seurat object's metadata for downstream analysis and plotting. ## Using a Custom Marker Database For specialized analyses, such as focusing on a specific tissue, working with a non-model organism, or using a proprietary list of markers, you can provide your own custom reference to `matchCellMarker2`. The reference must be a `data.frame` (or `data.table`) with at least two columns: `cell_name` and `marker`. The easiest way to create this is from a named list. **Step 1: Create a named list of your custom markers.** ```{r} custom_ref_list <- list( "T-cell" = c("CD3D", "CD3E", "CD3G"), "B-cell" = c("CD79A", "MS4A1"), "Myeloid" = c("LYZ", "CST3", "AIF1") ) print(custom_ref_list) ``` **Step 2: Convert the list to the required data.frame format.** `easybio` provides the `list2dt` helper function for this. ```{r} custom_ref_df <- list2dt(custom_ref_list, col_names = c("cell_name", "marker")) head(custom_ref_df) ``` **Step 3: Run `matchCellMarker2` with the `ref` parameter.** When `ref` is provided, the function ignores the `spc`, `tissueClass`, and `tissueType` parameters for matching. ```{r} marker_custom <- matchCellMarker2( marker = pbmc.markers, n = 50, ref = custom_ref_df ) # Note that the cell_name column now contains our custom cell types marker_custom[, head(.SD, 2), by = cluster] ``` ## Additional Utilities `easybio` also provides functions for direct queries. ### `get_marker()` Directly retrieve markers for any cell type of interest. ```{r} get_marker(spc = "Human", cell = c("Monocyte", "Neutrophil"), number = 5, min.count = 1) ``` ### `plotMarkerDistribution()` Check the distribution of a specific marker across all cell types and tissues in the database. ```{r, fig.width=7.5, fig.height=7} plotMarkerDistribution(mkr = "CD68") ``` ```{js, echo=FALSE} document.querySelectorAll('p img').forEach(img => { // 检查是否是空白透明图片(可以根据 src 精确匹配) if ( img.src === '' ) { const parentP = img.closest('p'); if (parentP) { parentP.remove(); } } }); ```