ClussCluster

The ClussCluster package implements the sparse clustering algorithm described in “Simultaneous detection of cell types and type-specific signature genes in single-cell transcriptome data”. This algorithm performs clustering, detects distinct cell types, and identifies signature genes which are markers of these cell types. This vignette will guide you through the steps necessary to apply the algorithm to your data. A real scRNA-seq data set will be used as an illustration.

1. Load the package

First, we need to install and load the ClussCluster package. For each of the following sections in this vignette, we assume this step has been carried out.

#install.packages('ClussCluster')
library(ClussCluster)

2. Example Data set

For this vignette we will use the scRNA-seq data created by Hou et.al (2016). This data set is also used in the paper. It contains normalized and log-transformed gene expression, log(FPKM+1), measurements for 33 cells and 27,033 genes. 25 single cancer cells are derived from a human hepatocellular carcinoma (HCC) tissue and the other 8 are single human hepatoblastoma-derived (HepG2) cells. Therefore, the 33 cells belong to three populations: HCC-sub I, HCC-sub II, and HepG2. Since those cells were derived from different tissues and have different biological functions, the subset of genes that separate HCC cells from HepG2 cells are unlikely to be the same as those that separate the two HCC subpopulations. We are going to apply the ClussCluster algorithm and identify the three subpopulations and their signature genes, respectively.

Since the Hou data set is large, the viewed result is using a truncated data of Hou, called “Hou_sim”, which has only 100 genes.We first view the top of the data set to ensure it is in the correct format. Here the genes are rows and the columns are cells as desired.

data(Hou_sim)
hou.dat <- Hou_sim$x
dim(hou.dat)

## [1] 100  33

hou.dat[1:10, 1:5]

##          Ca_01_RNA Ca_02_RNA Ca_03_RNA Ca_04_RNA Ca_05_RNA
## MIR6723  11.504714 11.565489 11.539363 11.805804 11.130832
## FTL       8.777141  9.504263  9.014779  8.447843  8.352809
## APOA2     9.576607  9.648479  9.472589  9.182871  9.688982
## AMBP      9.235179  9.140271  9.189189  9.152703  8.974000
## SERPINA1  8.558757  9.012121  8.978799  8.915654  8.684773
## RBP4      8.850374  8.343378  8.438167  8.892830  9.276493
## RPLP0     7.128400  7.213989  6.870081  6.612286  7.677984
## EEF1A1    6.765042  7.263995  7.224965  7.238353  7.177950
## UBB       7.578243  8.242412  7.852855  7.755377  7.460047
## RPL7A     6.788272  7.157401  7.140334  6.760244  6.616789

Here Hou is a list containing many slots. The slot ‘x’ is the expression data set that is normalized and log-transformed but not centered. The slot ‘groups’ stores the names of the subpopulations and the slot ‘y’ stores the cluster labels of all cells. In practice, the labels are unlikely to be available and will be estimated by ClussCluster. In Section 5 of this vignette, the labels will be used as a reference to examine the clustering results of ClussCluster.

Hou_sim$groups

## [1] "HCC_subpop_1" "HCC_subpop_2" "HepG2"

table(Hou_sim$y)

## 
##  1  2  3 
## 18  7  8

The gene and cell names are also stored:

Hou_sim$gnames[1:10]

##  [1] "MIR6723"  "FTL"      "APOA2"    "AMBP"     "SERPINA1" "RBP4"     "RPLP0"   
##  [8] "EEF1A1"   "UBB"      "RPL7A"

Hou_sim$snames

##  [1] "Ca_01_RNA"          "Ca_02_RNA"          "Ca_03_RNA"          "Ca_04_RNA"         
##  [5] "Ca_05_RNA"          "Ca_06_RNA"          "Ca_07_RNA"          "Ca_08_RNA"         
##  [9] "Ca_09_RNA"          "Ca_10_RNA"          "Ca_11_RNA"          "Ca_12_RNA"         
## [13] "Ca_13_RNA"          "Ca_14_RNA"          "Ca_15_RNA"          "Ca_16_RNA"         
## [17] "Ca_17_RNA"          "Ca_18_RNA"          "Ca_19_RNA"          "Ca_20_RNA"         
## [21] "Ca_21_RNA"          "Ca_22_RNA"          "Ca_23_RNA"          "Ca_24_RNA"         
## [25] "Ca_25_RNA"          "scTrio_HepG2_1_RNA" "scTrio_HepG2_2_RNA" "scTrio_HepG2_3_RNA"
## [29] "scTrio_HepG2_4_RNA" "scTrio_HepG2_5_RNA" "scTrio_HepG2_6_RNA" "scRNA-seq_HepG2_1" 
## [33] "scRNA-seq_HepG2_2"

3. Pre-processing the data

Before beginning an analysis using ClussCluster, you will need to carry out a few preprocessing steps. This includes normalization, log transformation, filtering of genes that are mostly zero, and getting the data into the format that is expected by the ClussCluster function.

ClussCluster requires that data be normalized for sequencing depth prior to running ClussCluster. It is recommended that users make use of one of the many methods that exist for normalizing scRNA-Seq data. We also recommend that the data is log-transformed.

With real data, it is advisable to filter the input gene set to remove genes that have an extremely high proportion of zeroes. That means, genes where only one or fewer cells have a nonzero measurement should be filtered out. Because of this, we do not recommend that the data be centered prior to the filtering. Besides zero counts, genes with very low expression mean and variance should also be filtered out. In this section, we demonstrate the utility of the filter_gene function, which can be helpful if working with the log-transformed and normalized expression data.

Three thresholds can be varied:

n0prop: minimum proportion of zeroes. For example, n0prop=0.2 means that genes are filtered out if they are less than 20 percent zero;
minmean: minimum mean of expression. For example, minmean=2 means that genes are filtered out if their mean expression is less than 2;
minsd: minimum standard deviation of expression. For example, minsd=1.5 means that genes are filtered out if the standard deviation of expression is less than 1.5.

First, load the Hou example data set:

data(Hou_sim)
hou.dat <- Hou_sim$x

Now, apply the filter_gene function, use a threshold of 0.2 on the proportion of zeroes, 1.0 on the mean of expression, and 1.5 on the standard deviation of expression:

run.ft <- filter_gene(hou.dat, minmean=1.0, n0prop=0.2, minsd=1.5)

## 100out of100genes have mean expression >=1TRUE

## 100out of100genes have proportion of non-zero expression >=0.2TRUE

## 16out of100genes have standard deviation of expression >=1.5TRUE

## Overall,16out of100genes are kept.TRUE

dat.ft <- run.ft$dat.ft

dim(dat.ft)

## [1] 16 33

As we can see, this combination of filters removes 94 genes and keeps 16 genes.

Here run.ft is a list that contains several slots:

dfname: the original data set before filtering;
dat.ft: the data set after filtering;
index: the index of genes that are kept;
minmean, n0prop, minsd: thresholds used in the filtering.

Here we view the top of the filtered data:

dat.ft[1:10, 1:5]

##          Ca_01_RNA Ca_02_RNA Ca_03_RNA Ca_04_RNA Ca_05_RNA
## EEF1A1    6.765042  7.263995  7.224965  7.238353  7.177950
## APOC3     8.379980  8.454183  8.533255  8.310009  8.674804
## APOH      9.071375  8.342328  8.581050  7.980407  8.447236
## TF        8.447174  8.373840  8.533098  8.066634  8.466870
## APOA1     8.805525  7.824874  7.810661  8.435543  9.272113
## AHSG      7.673000  7.182702  7.274632  6.065108  7.948187
## GNB2L1    6.446358  6.422880  5.787146  5.851927  6.230214
## VTN       8.596627  8.629278  8.694035  8.253116  8.293972
## HP        7.748529  7.419285  7.682897  8.187730  7.335614
## SERPINA3  6.797823  6.795819  6.663207  7.074701  6.143505

and check if the filtering was succesful by examing the zero proportion, mean, standard deviation of each gene:

summary(apply(dat.ft!=0, 1, mean))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9394  0.9924  1.0000  0.9905  1.0000  1.0000

summary(apply(dat.ft, 1, mean))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.957   6.183   6.423   6.517   6.729   7.529

summary(apply(dat.ft, 1, sd))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.502   1.665   1.775   1.987   2.287   3.026

4. Determine the tuning parameter

The tuning parameter s is pivotal to the algorithm’s performance and needs to be decided beforehand. It controls the level of sparsity of the marker genes for each cluster and the values in the resulting weight matrix. In the literature, a permutation approach that is closely related to the gap statistic is recommended.

We demonstrate the utility of the ClussCluster_Gap function to generate permutation samples and calculate the gap statistic. A set of candidate tuning parameter needs to be supplied beforehand and the one that maximizes the gap statistic will be selected. See more details in the paper.

To determine the value of s the following needs to be provided:

dat.ft: the expression data, regardless it is filtered or raw data set;
nclust or centers: the number of clusters or the centers of clusters. For the Hou data there are three subpopulations present in the data set and we put nclust = 3;
ws: a set of candidate tuning parameters. Here we choose the set of initial parameters to be 2.4, 3.1, and 3.8.

The default number of permuation sample is B = 20. To save time in this example, we set B = 5. Now, apply the ClussCluster_Gap function on the filtered data set dat.ft:

run.gap <- ClussCluster_Gap(dat.ft, B=5, nclust = 3, ws = c(2.4, 3.1, 3.8))

## s = 2.4

## s = 3.1

## s = 3.8

## permutation sample 1

## s = 2.4

## s = 3.1

## s = 3.8

## permutation sample 2

## s = 2.4

## s = 3.1

## s = 3.8

## permutation sample 3

## s = 2.4

## s = 3.1

## s = 3.8

## permutation sample 4

## s = 2.4

## s = 3.1

## s = 3.8

## permutation sample 5

## s = 2.4

## s = 3.1

## s = 3.8

We can view the results of the gap statistics by applying the print_ClussCluster_Gap function. The number of clusters, average number of genes with non-zero weights across the clusters, gap statistics and their standard deviations are listed:

print_ClussCluster_Gap(run.gap)

## Tuning parameter selection results for ClussCluster:
##   Wbound # Clusters # Mean Non-Zero W's Gap Statistic Standard Deviation
## 1    2.4          3               12.33        0.3679             0.0970
## 2    3.1          3               15.33        0.3542             0.0446
## 3    3.8          3               15.33        0.3581             0.0446
## Tuning parameter that leads to largest Gap statistic:  2.4
## Tuning parameter within one sd of the largest Gap statistic:  2.4

As the tuning parameter increases, the average number of signature genes and the value of the gap statistic tend to increase. The one that maximizes the gap statistic is. Alternatively, one can choose a tuning parameter to be the smallest value that is within one standard deviation of the maximal gap statistic. This results in fewer genes with positive weights. In practice, one may use this one-standard-deviation rule to select a smaller number of genes for more interpretable results.

Once the tuning parameter has been decided, we can go ahead and run the ClussCluster algorithm. For this vignette, we will use the one selected by the one-sd rule:

s <- run.gap$onesd.bestw
s

## [1] 2.4

5. Run ClussCluster

The ClussCluster algorithm aims to detect different cell types and identify cell-type-specific signature genes in single-cell RNA sequencing data.

First, we need to specify the parameter arguments that we’ll pass to the ClussCluster function. Similar to ClussCluster_Gap, we set the data set to be x = dat.ft, the number of clusters nclust = 3 (one can also set the centers of the clusters instead) and the tuning parameters ws = run.gap$onesd.bestw. See ?ClussCluster for how to specify other optional parameters.

Note: The users can choose any tuning parameter other than run.gap$onesd.bestw. One can also provide multiple tuning parameters to ClussCluster function.

If the users have run ClussCluster_Gap in the previous step, every candidate tuning parameter was already evaluated by ClussCluster function and the results are stored in run.gap$run. Therefore, if the tuning parameter to be used in this step is in the set of candidate tuning parameter, say ws = 3.8, one can either run the ClussCluster function or directly extract the results from the ClussCluster_Gap object:

i<- match(run.gap$onesd.bestw, c(2.4, 3.1, 3.8))
run.cc <- run.gap$run[[i]]
run.cc <- ClussCluster(dat.ft, nclust = 3, ws = run.gap$onesd.bestw)

The two sets of commands will produce the same results.

Viewing Clustering Results

Here we first check the accuracy of the clustering results by examining the confusion matrix of the estimated cluster labels in run.cc and the true labels in Hou$y:

theta <- run.cc$theta
table(Hou_sim$y, theta)

##    theta
##      1  2  3
##   1  0  0 18
##   2  7  0  0
##   3  0  7  1

Here we see that almost all 33 cells were clustered to the correct cell subpopulation with the HCC_subpop_1 cells making up cluster 1, the HCC_subpop_2 cells making up cluster 2 and the HepG2 cells making up cluster 3.

Note that the labels can change arbitrarily with different seeds in the clustering process, it is crucial that the users find the one-to-one correspondence between the cluster labels and the cell types.

We have found in simulation studies that the accuracy of the clustering is closely related to the accuracy of the selected signature genes in the next step. For example, if half of the cells in cluster 1 are HCC_subpop_1 cells but the other half are not, then the signature genes of cluster 1 should not be identified as marker genes of HCC_subpop_1.

Cell-type-specific signature genes

We first apply the print_ClussCluster function to the ClussCluster object to print the number of signature genes for each cluster:

print_ClussCluster(run.cc)

## Tuning parameter is:  2.4
## Number of signature genes of each cluster:  14 12 11
## Clustering:  3 3 3 3 3 1 3 3 3 3 3 3 1 3 1 3 3 3 3 1 1 1 3 3 1 2 3 2 2 2 2 2 2
## Value of objective function:  945.4568

The importance of each gene to each cluster is stored in the weight matrix. The matrix is of dimension p x K, where p is the number of genes and K is the number of clusters. The first column of the weight matrix contains the importance of all genes to HCC_subpop_1 cells, the second column to the HCC_subpop_2 cells, and the third column to the HepG2 cells. The unimportant genes are assigned zero weights whereas important genes are given positive weights.

Here we view the top of the weight matrix:

wt.mat <- run.cc$w
head(wt.mat, 10)

##             [,1]       [,2]       [,3]
##  [1,] 0.06952553 0.08100039 0.00000000
##  [2,] 0.05851836 0.25708289 0.17314387
##  [3,] 0.12997910 0.00000000 0.09069054
##  [4,] 0.14490477 0.00000000 0.05212584
##  [5,] 0.85715295 0.19538397 0.65797477
##  [6,] 0.32643314 0.13778463 0.16979373
##  [7,] 0.01924033 0.10334980 0.00000000
##  [8,] 0.12588686 0.29869048 0.31495780
##  [9,] 0.07957272 0.81501781 0.37367290
## [10,] 0.16872714 0.00000000 0.06070300

To extract the index and the names of the signature genes for each cluster we can run:

sig_index <- apply(wt.mat, 2, function(w) which(w!=0))
sig_names <- apply(wt.mat, 2, function(w) rownames(dat.ft)[which(w!=0)])
sig_names

## [[1]]
##  [1] "EEF1A1"   "APOC3"    "APOH"     "TF"       "APOA1"    "AHSG"     "GNB2L1"  
##  [8] "VTN"      "HP"       "SERPINA3" "RPL4"     "SERPINC1" "RPSA"     "CYB5A"   
## 
## [[2]]
##  [1] "EEF1A1"   "APOC3"    "APOA1"    "AHSG"     "GNB2L1"   "VTN"      "HP"      
##  [8] "RPL7"     "RPL4"     "SERPINC1" "RPL5"     "RPSA"    
## 
## [[3]]
##  [1] "APOC3"    "APOH"     "TF"       "APOA1"    "AHSG"     "VTN"      "HP"      
##  [8] "SERPINA3" "RPL7"     "SERPINC1" "CYB5A"

The top 5 signature genes for each cluster can be identified as:

top_5_genes <- apply(wt.mat, 2, function(w) rownames(dat.ft)[order(w, decreasing = T)[1:5]])
top_5_genes

##      [,1]       [,2]       [,3]      
## [1,] "APOA1"    "HP"       "APOA1"   
## [2,] "AHSG"     "SERPINC1" "SERPINC1"
## [3,] "SERPINC1" "VTN"      "HP"      
## [4,] "SERPINA3" "APOC3"    "VTN"     
## [5,] "TF"       "APOA1"    "APOC3"

6. Plot

Besides the results shown in Sections 4 and 5 in this vignette, this package also provides options to visualize the results of ClussCluster and ClussCluster_Gap. Next we demonstrate the utilities of functions plot_ClussCluster and plot_ClussCluster_Gap to produce four types of plots. The objects we will need are run.gap and run.cc.

With the run.gap object, we apply the plot_ClussCluster_Gap function to plot the gap statistics against the candidate tuning parameters. Their standard deviations are plotted as the vertical bars. It would helpful to users to eyeball the one that maximizes the gap statistic and the one chosen by the one-sd rule:

plot_ClussCluster_Gap(run.gap)

With the run.cc object, there are several ways to apply the plot_ClussCluster function. If multiple candidate tuning parameters are evaluated, the number of signature genes of each cluster are plotted against the tuning parameters, with each color and line type corresponding to one cluster:

plot_ClussCluster(run.gap$run)

If only one tuning parameter is evaluated, then the user can have a closer look at the results. Two plots will be produced: the venn diagram of the cluster-specific signature genes and the heatmap of the data set with the top m genes from each cluster. The venn diagram shows the overlap of the cluster-specific genes so that we know which genes are important to all clusters and the which genes are uniquely important to one cluster. The heatmap shows the expression patterns of the signature genes, they can be over-expressed, under-expressed, or consistent within a particular cluster.

For this vignette, the heatmap is plotted based on top 5 genes:

plot_ClussCluster(run.cc, m = 5, snames=Hou_sim$snames, gnames=rownames(dat.ft))