--- title: "Introduction to SifiNet" author: - "Qi Gao" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An introduction to the SifiNet package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} --- ```{r config, echo = FALSE} knitr::opts_chunk$set( tidy = TRUE, cache = FALSE, dev = "png" ) ``` SifiNet (Single-cell feature identification with Network topology) is an R package of a clustering-independent method for directly identifying feature gene sets. Feature gene set are sets of gene exhibiting unique phenotypes in different cell subpopulations. SifiNet is designed for single cell RNA-seq data, finding sets of feature genes that are differentially expressed in some cell subpopulation. However, it can also find epigenomic feature gene sets when applied to single cell ATAC-seq data. This vignette gives an introduction to SifiNet package. # Installation The development version of SifiNet can be installed from Github: ```{r install-github, eval = FALSE} devtools::install_github("jichunxie/sifinet") ``` # SifiNet method SifiNet method consists of two major sections, A and B. After loading the data, in section A we construct a gene co-expression network with genes as nodes and the co-expressions between genes as edges. And in section B, we identify the feature gene sets based on the gene co-expression network. ## Create SifiNet object Suppose we already have a matrix of count data that we wish to find feature gene sets from. Here is a toy example generated with the `complexSCsimulation` package: ```{r load_data} # Load package library(SiFINeT) # Load data data <- readRDS("toy_matrix.rds") sifi_obj <- create_SiFINeT_object( counts = data, gene.name = NULL, meta.data = NULL, data.name = NULL, sparse = FALSE, rowfeature = TRUE ) ``` The `count` matrix should be a gene (row) by cell (column) matrix by default. Otherwise, for a cell (row) by gene (column) matrix, the `rowfeature` argument need to be set to `FALSE`. Name of the genes could be specified by the `gene.name` argument. Meta data of the cells could be input using the `meta.data` argument. If the input count matrix is sparse (majority of the matrix is 0), we can set the `sparse` argument to be `TRUE`, which would potentially reduce the space and time cost of the method. We can also assign a name to the input data set by `data.name` argument. Multiple count matrix could be saved in a SifiNet object. The name of data set could help to distinguish those inputs. ## Section A: Construct the gene co-expression network ```{r network1} # estimate the quantiles sifi_obj <- quantile_thres(sifi_obj) sifi_obj <- feature_coexp(sifi_obj) ``` In SifiNet, we use quantile association to measure the co-expression between genes. So, to construct the gene co-expression network, we first divide the read counts in the matrix into "low" and "high" groups using quantile regression in `quantile_thres` function. If `sifi_obj@meta.data` has at least one column, then the columns of `sifi_obj@meta.data` are used as independent variables. Otherwise, the mean total number of reads in each cell is used as independent variable for the quantile regression. The results are saved in `sifi_obj@data.thres`. Based on the results of the "low" and "high" group classification, we calculate the co-expression score using `feature_coexp` function. The results are saved in `sifi_obj@coexp`. ```{r network2} # build the co-expression network sifi_obj <- create_network(sifi_obj, alpha = 0.05, manual = FALSE, least_edge_prop = 0.01 ) ``` Then we select a threshold for co-expression score. Gene pairs with absolute value of co-expression score greater than the threshold would be considered to have an edge between them in the network. This procedure is done by function `create_network`. By default, an FDR control procedure is applied to select the threshold. Argument `alpha` is the type I error of the FDR control procedure. After finding the threshold using the FDR control procedure, SifiNet also allows an additional modification of the threshold. In case that the threshold is overly strict, we can set a lower bound for the proportion of edges (total number of edges divided by the total number of different gene pairs) with argument `least_edge_prop`. Note the `least_edge_prop` is only used when argument `manual` is `TRUE`. The results are saved in `sifi_obj@est_ms` and `sifi_obj@thres`. ```{r network3} # filter gene to improve the quality of the co-expression network sifi_obj <- filter_lowexp(sifi_obj, t1 = 10, t2 = 0.9, t3 = 0.9) ``` After building the co-expression network, we further perform additional gene filtering steps to improve the quality of the co- expression network. Greater values of arguments `t1`, `t2`, and `t3` would lead to a less strict filtering, keeping more genes in the network. The results are saved in `sifi_obj@kset`. ## Section B: Calculate gene connectivity and find feature gene sets ```{r feature1} # calculate connectivity for each gene sifi_obj <- cal_connectivity(sifi_obj, m = 10, niter = 100) ``` In section B, based on the topology structure of gene co-expression network, we identify the feature gene sets. Using function `cal_connectivity`, we calculate the 1st, 2nd, and 3rd order connectivities for each gene. The connectivities are measures of the topology structure of gene co-expression network. Greater values of argument `m` and `niter` would improve the accuracy of 3rd order connectivity, but would also increase the computation time. The results are saved in `sifi_obj@conn`. ```{r feature2} # detect core feature gene sets sifi_obj <- find_unique_feature(sifi_obj, t1 = 5, t2 = 0.4, t3 = 0.3, t1p = 5, t2p = 0.7, t3p = 0.5, resolution = 1, min_set_size = 5 ) ``` After calculating the connectivities, we can detect candidate feature genes. And then we identify the core feature genes in each of the feature gene sets among the candidate feature genes. Core feature genes are genes that show unique phenotypes in only one cell subpopulation (not shared with other cell subpopulation). These steps are done by function `find_unique_feature`. Arguments `t1`, `t2`, `t3`, `t1p`, `t2p`, `t3p` are thresholds for 1st, 2nd, and 3rd order connectivities. Greater input values of those arguments would lead to less number of feature gene sets and feature genes. `resolution` argument is used for `cluster_louvain` function in `igraph` package. We require a feature gene set to have at least `min_set_size` core feature genes to be detected. The results are saved in `sifi_obj@conn2`, `sifi_obj@fg_id`, `sifi_obj@uni_fg_id`, `sifi_obj@uni_cluster`, `sifi_obj@selected_cluster` and `sifi_obj@featureset`. ```{r feature3} # assign shared feature genes to core feature gene sets sifi_obj <- assign_shared_feature(sifi_obj, min_edge_prop = 0.5) ``` Then we assign other candidate feature genes (that are not identified as core feature gene) into core feature gene sets. Those genes are allowed to be shared by more than 1 core feature gene sets. The assignment of shared feature gene is based on the proportion of edge between a core feature gene set and the feature gene (number of edges between the feature gene and genes in the core feature gene set divided by number of genes in the core feature gene set). An assignment with greater value of argument `min_edge_prop` would have less shared feature gene added. The results are saved in `sifi_obj@featureset`. ```{r feature4} # enrich feature gene sets with other genes sifi_obj <- enrich_feature_set(sifi_obj, min_edge_prop = 0.9) ``` In the last step, we enrich the feature gene sets (with both core and shared feature gene) with other genes that are not identified previously as feature gene. Those genes are also allowed to be shared by more than 1 core feature gene sets. Similarly, the assignment of enriched feature gene is based on the proportion of edge between a feature gene set and the gene (number of edges between the gene and genes in the feature gene set divided by number of genes in the feature gene set). An assignment with greater value of argument `min_edge_prop` would have less shared feature gene added. The results are saved in `sifi_obj@featureset`. # Session information ```{r sessionInfo} sessionInfo() ```