A new Poisson probability paradigm for single cell RNA-seq clustering

Yue Pan

2022-08-15

Introduction

This package is developed to visualize the Poissoneity of scRNA-seq UMI data, and explore cell clustering based on model departure as a novel data representation.

In the following sections, we introduce the approach for assessment of the validity of the independent Poisson statistical framework using Q-Q envelope plots. The code here is specific for validation of independent Poisson distributions of scRNA-seq data matrix entries, but such idea can be applied to different types of data under different assumptions of distributions.

Then we propose using model departure as a novel data representation, which is a measurement of relative location of that UMI count with respect to the independent Poisson distribution at the individual entry level. A departure-based cell clustering algorithm is developed to explore cell subpopulations.

Installing

library(scpoisson)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Poissoneity of scRNA-seq data

ScRNA-seq data contains a large number of zeros, which makes the simple multiplication ineffective because zeros are not appropriately scaled by multiplication, motivating new statistical approaches. Here, we propose a different, more principled approach. Instead of focusing on gene averages and working with normalized and scaled counts data, we prioritize the individual UMI matrix entries. This approach is based on an independent Poisson statistical framework, where each RNA measurements for each cell comes from its own Poisson distribution.

In the following example, we show how to visualize the Poissoneity of scRNA-seq UMI data by Q-Q envelope plots. The envelopes are generated from theoretical distribution, which provide a quantitative way measuring deviation from the diagonal line. If the theoretical distribution fits the data well, the curve should be located around the diagonal line and within the envelope.

The GLM-PCA algorithm is applied for parameter estimation. Then some matrix entries (default 200, with estimated Poisson parameters closest to the given Poisson parameter) will be selected as sample data and compare with the theoretical Poisson distribution. The sample data we provided in this package contains 75 cells from one clonal cell line. Here we take \(L=10\) since that is shown to be an appropriate number of latent vectors for this relatively homogeneous data set. In practice, we suggest try a range of \(L\) and find the one which gives best fit. This number can be different for different data sets. As long as there exists such an \(L\), which gives reasonable fit under the Poisson statistical framework, that demonstrates the “Poissoneity” of such data set.

The similar idea can be used to visualize different types of data under different assumptions of distributions.

# UMI count matrix
test_dat <- as.data.frame(get_example_data("p5"))
scppp_obj <- scppp(test_dat, "rows")

# Q-Q envelope plot
qqplot_env_pois(scppp_obj, L = 10, lambda = 5) 

As shown the the above figure, the aggregated 200 matrix entries were well fitted by Poisson distribution with parameter 5, i.e. these entries can be regarded as independent random samples generated from Poisson distribution with mean 5.

Cell clustering based on model departure

A major application of this concept “Poissoneity” is clustering using a model departure as data representation.

The initial step is based on a crude two-way parameter approximation, where variation across cells is modeled by cell level parameters, and variation across genes is modeled by gene level parameters. This initial step in itself does not appropriately account for cell heterogeneity (different cell types). In the next step such interesting structure is captured by departures from the naive two-way approximation and the original count matrix is replaced by a Poisson departure matrix. In the departure matrix, each entry is quantified by the relative location of that original count with respect to the tentative Poisson distribution, whose parameter comes from the initial two-way approximation. The departure measure is captured by a Cumulative Distribution Function (CDF), which leaves the unexpectedly small counts nearly 0 and unusually large counts close to 1. Next, the departure measure is put on a more statistically amenable scale using the logit function. As a result, unexpectedly large counts give large positive values and unexpectedly small counts give large negative values. This departure matrix forms the input for cell clustering as a downstream analysis.

The following example shows how to run our clustering pipeline using a model departure as data representation. In this simple case, set number of simulations \(sim = 100\) is enough to separate the two known cell lines (71 cells from BCBL cell line and 58 from JSC cell line). For more complicated data sets, we suggest set the number of simulations to 500 or 1000 for more reliable clustering results.

The clustering results are stored in object scppp under “clust_results”. You can refer to “Hclust” under that for more details.

# UMI count matrix
test_dat <- as.data.frame(get_example_data("p56"))
scppp_obj <- scppp(test_dat, "rows")

scppp_obj <- HclustDepart(scppp_obj, maxSplit = 3)

# cluster label for each cell
clust_res <- scppp_obj[["clust_results"]]$Hclust[[1]]
head(clust_res)
##          names cluster
## 1 Plate05A_A02       1
## 2 Plate05A_A03       1
## 3 Plate05A_A04       1
## 4 Plate05A_A05       1
## 5 Plate05A_A06       1
## 6 Plate05A_A08       1
table(clust_res[,2])
## 
##  1  2 
## 71 58

The model departure data representation can also be fitted into any other clustering pipeline. Another option contained in this package is: keep departure as data representation but apply the Louvain algorithm (implemented in Seurat pipeline) for clustering. In general, this is a faster clustering algorithm to do clustering. However, it requires carefully tuning about resolution parameter, which can lead to different clustering results for the same data set. In this simple case, keeping the default resolution parameter (0.8) gives the correct clustering results as known cell lines.

The clustering results are also stored in object scppp under “clust_results”. You can refer to “Lclust” under that for more details.

scppp_obj <- LouvainDepart(scppp_obj)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 129
## Number of edges: 3799
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.5998
## Number of communities: 2
## Elapsed time: 0 seconds
# cluster label for each cell
clust_res2 <- scppp_obj[["clust_results"]]$Lclust[[4]]
head(clust_res2)
##                     names cluster
## Plate05A_A02 Plate05A_A02       0
## Plate05A_A03 Plate05A_A03       0
## Plate05A_A04 Plate05A_A04       0
## Plate05A_A05 Plate05A_A05       0
## Plate05A_A06 Plate05A_A06       0
## Plate05A_A08 Plate05A_A08       0
table(clust_res2[,2])
## 
##  0  1 
## 71 58

References

Townes, F. W., Hicks, S. C., Aryee, M. J., & Irizarry, R. A. (2019). Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome biology, 20(1), 1-16.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, III WMM, Hao Y, Stoeckius M, Smibert P, Satija R (2019). “Comprehensive Integration of Single-Cell Data.” Cell, 177, 1888-1902. doi: 10.1016/j.cell.2019.05.031, https://doi.org/10.1016/j.cell.2019.05.031.