Recent advances in biochemistry and single-cell RNA sequencing (scRNA-seq) have allowed us to monitor the biological systems at the single-cell resolution. However, the low capture of mRNA material within individual cells often leads to inaccurate quantification of genetic material. Consequently, a significant amount of expression values are reported as missing, which are often referred to as dropouts. These missing information can heavily impact the accuracy of downstream analyses.
To addreess this problem, we introduce a novel inputation method, names single-cell Imputation via Subspace Regression (scISR). scISR performs imputation for single-cell sequencing data. scISR identifies the true dropout values in the scRNA-seq dataset using hyper-geomtric testing approach. Based on the result obtained from hyper-geometric testing, the original dataset is segregated into two including training data and imputable data. Next, training data is used for constructing a generalize linear regression model that is used for imputation on the imputable data.
scISR package can be installed from GitHub using
# Install devtools package utils::install.packages('devtools') # Install scISR from GitHub devtools::install_github('duct317/scISR')
Load the example, Goolam dataset. The data is a list consists of two elements: data matrix with rows as genes and columns as samples, and a vector with the cell types information.
#Load required library library(scISR) # Load example data (Goolam dataset with reduced number of genes), other dataset can be download from our server at http://scisr.tinnguyen-lab.com/ data('Goolam') # Raw data raw <- Goolam$data # Cell types information label <- Goolam$label
We can use the main funtion
scISR to impute the raw data.
# Generating subtyping result set.seed(1) imputed <- scISR(data = raw, ncores = 4)
The quality of the imputation can be evaluated using clustering analysis. We can evaluate the accuracy of clusters obtained from clustering analysis using raw and imputed data with adjusted Rand Index (ARI). Higher ARI means higher agreement between a given clustering and the ground truth. The clusters from imputed data have much higher accuracy.
library(irlba) library(mclust) # Perform PCA and k-means clustering on raw data set.seed(1) # Filter genes that have only zeros from raw data raw_filer <- raw[rowSums(raw != 0) > 0, ] pca_raw <- irlba::prcomp_irlba(t(raw_filer), n = 50)$x cluster_raw <- kmeans(pca_raw, length(unique(label)), nstart = 2000, iter.max = 2000)$cluster print(paste('ARI of clusters using raw data:', round(adjustedRandIndex(cluster_raw, label),3))) # Perform PCA and k-means clustering on imputed data set.seed(1) pca_imputed <- irlba::prcomp_irlba(t(imputed), n = 50)$x cluster_imputed <- kmeans(pca_imputed, length(unique(label)), nstart = 2000, iter.max = 2000)$cluster print(paste('ARI of clusters using imputed data:', round(adjustedRandIndex(cluster_imputed, label),3)))