Introduction to cinaR

Quick Start
Creating different contrasts
Comparison scheme
Running for bulk RNA-seq data
Running enrichment with different dataset
Selecting different reference genomes
Batch Effect Correction
Adding extra covariates
Saving DA peaks to excel
Using different GLM algorithms
Some useful arguments
References
Session Info

Quick Start

library(cinaR)
data("atac_seq_consensus_bm")

Bed formatted consensus matrix (chr, start, end and samples)

dim(bed)

## [1] 1000   25

# bed formatted file
head(bed[,1:4])

##         Chr     Start       End B6-18mo-M-BM-47-GT18-01783
## 52834  chr5  24841478  24845196                       1592
## 29780 chr17   8162955   8164380                        109
## 67290  chr8  40577584  40578029                         72
## 51295  chr4 145277698 145278483                        110
## 4267   chr1 180808752 180815472                       2452
## 45102  chr3  88732151  88732652                         49

Create the contrasts you want to compare, here we create contrasts for 22 mice samples from different strains.

# create contrast vector which will be compared.
contrasts<- c("B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO", 
              "B6", "B6", "B6", "B6", "B6", "NZO", "NZO", "NZO", "NZO", "NZO", "NZO")

cinaR function directly computes the differentially accessible peaks.

# If reference genome is not set hg38 will be used!
results <- cinaR(bed, contrasts, reference.genome = "mm10")

## >> preparing features information...      2022-05-17 10:45:12 
## >> identifying nearest features...        2022-05-17 10:45:12 
## >> calculating distance from peak to TSS...   2022-05-17 10:45:13 
## >> assigning genomic annotation...        2022-05-17 10:45:13 
## >> assigning chromosome lengths           2022-05-17 10:45:27 
## >> done...                    2022-05-17 10:45:27

Now, you can access differential accessibility (DA) and enrichment results.

names(results)

## [1] "DA.results"         "Enrichment.Results"

Inside DA.results, you have the consensus peaks (cp) and differentially accessible (DA) peaks. If batch correction was run, then cp will be a batch-corrected consensus matrix, otherwise it is the filtered and normalized version of original consensus peaks you provided.

names(results$DA.results)

## [1] "cp"       "DA.peaks"

There are many information cinaR provides such as adjusted p value, log fold-changes, gene names etc for each peak:

colnames(results$DA.results$DA.peaks$B6_NZO)

##  [1] "Row.names"     "seqnames"      "start"         "end"          
##  [5] "width"         "strand"        "annotation"    "geneChr"      
##  [9] "geneStart"     "geneEnd"       "geneLength"    "geneStrand"   
## [13] "geneId"        "transcriptId"  "distanceToTSS" "gene_name"    
## [17] "logFC"         "FDR"

Here is an overview of those DA peaks:

head(results$DA.results$DA.peaks$B6_NZO[,1:5])

##                   Row.names seqnames     start       end width
## 1 chr10_105840598_105842176    chr10 105840598 105842176  1579
## 2   chr10_59950325_59952673    chr10  59950325  59952673  2349
## 3   chr10_63176490_63176839    chr10  63176490  63176839   350
## 4   chr10_77220928_77221910    chr10  77220928  77221910   983
## 5   chr10_79751429_79751786    chr10  79751429  79751786   358
## 6   chr10_86021157_86023861    chr10  86021157  86023861  2705

Since the comparison is B6_NZO, if fold-changes are positive it means they are more accesible in B6 compared to NZO and vice versa for negative values!

and here is a little overview for enrichment analyses results:

head(results$Enrichment.Results$B6_NZO[,c("module.name","overlapping.genes", "adj.p")])

##                 module.name                   overlapping.genes      adj.p
## 1         Myeloid lineage 1 TFEB,FBXL5,PLXNC1,GM2A,AGTPBP1,CTSB 0.05914491
## 2  U_metabolism/replication             SLC2A6,GM2A,CTSB,PECAM1 0.05914491
## 3  U_mitochondrial proteins     PIK3R1,PAQR3,UBE3A,MAP4K4,PTPRC 0.32816305
## 4 U_proteasome/ubiquitin cx                  PIK3R1,IREB2,PTPRC 0.39112517
## 5   U_Immunity/cytoskeleton                          RPS6,RPS19 0.66488512
## 6         Myeloid lineage 2                        RNF157,MTUS1 0.66488512

PCA Plots

You can easily get the PCA plots of the samples:

pca_plot(results, contrasts, show.names = F)

You can overlay different information onto PCA plots as well!

# Overlaid information
overlaid.info <- c("B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo", "B6-18mo", 
                   "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", "NZO-18mo", 
                   "B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo", "B6-3mo", 
                   "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo", "NZO-3mo")
# Sample IDs
sample.names <- c("S01783", "S01780", "S01781", "S01778", "S01779", 
"S03804", "S03805", "S03806", "S03807", "S03808", 
"S03809", "S04678", "S04679", "S04680", "S04681", 
"S04682", "S10918", "S10916", "S10919", "S10921", 
"S10917", "S10920")

pca_plot(results, overlaid.info, sample.names)

## Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Heatmaps

Differential peaks

You can see the available comparisons using:

show_comparisons(results)

## [1] "B6_NZO"

Then, plot the differential peaks for a selected contrast using:

heatmap_differential(results, comparison = "B6_NZO")

Also, you can configure your heatmaps using the additional arguments of pheatmap function. For more information check out ?pheatmap.

heatmap_differential(results, comparison = "B6_NZO", show_colnames = FALSE)

Most variable peaks

You can also plot most variable 100 peaks for all samples:

heatmap_var_peaks(results)

Plus, you can set the number of peaks to be used in these plots, and again you can change the additional arguments of pheatmap function. For more information check out ?pheatmap.

heatmap_var_peaks(results, heatmap.peak.count = 200, cluster_cols = F)

Enrichment Plots

You can plot your enrichment results using:

dot_plot(results)

## Warning: Removed 54 rows containing missing values (geom_point).

if it gets too crowded, you can get rid of the irrelevant pathways as well:

dot_plot(results, filter.pathways = T)

Creating different contrasts

Note that you can further divide the resolution of contrasts, for instance this is also a valid vector

contrasts <- sapply(strsplit(colnames(bed), split = "-", fixed = T), 
                    function(x){paste(x[1:4], collapse = ".")})[4:25]
unique(contrasts)

## [1] "B6.18mo.M.BM"  "B6.18mo.F.BM"  "NZO.18mo.F.BM" "NZO.18mo.M.BM"
## [5] "B6.3mo.F.BM"   "B6.3mo.M.BM"   "NZO.3mo.F.BM"  "NZO.3mo.M.BM"

in this case, each of them will be compared to each other which will result in 28 different comparisons.

Comparison scheme

As default, cinaR will use one vs one (OVA) scheme, comparing each contrast to others one by one. However, this can be changed to one vs all (OVA) which will compare each contrast to rest:

# one-vs-one (results in n choose k comparisons, default)
cinaR(..., comparison.scheme = "OVO")

# one-vs-all (results in n comparisons)
cinaR(..., comparison.scheme = "OVA")

Running for bulk RNA-seq data

To run cinaR with RNA-seq experiments, just provide the count matrix, and specify the experiment type:

cinaR(matrix = count.matrix, ..., experiment.type = "RNA-Seq")

Note that, count.matrix should be in the form of $g \times (1+n)$ where $g$ is the number of genes and $n$ is the number of samples, and plus one is for gene names.

Note that currently cinaR can only handle gene symbols (e.g. FOSL2, FOXA) and ensembl stable IDs (e.g. ENSG00000010404) for both human and mice!

Running enrichment with different dataset

You can run the enrichment analyses with a custom gene set:

cinaR(..., geneset = new_geneset)

`cinaRgenesets`

Easiest way to do this is to use cinaRgenesets package. You can select your gene set of interest and just plug it into your pipeline.

MSigDB

You can also download different gene sets from MSigDB site. Note that you should use the human gene symbol versions.

You can use read.gmt function from qusage package to read these genesets into your current environment.

Custom gene sets

A geneset must be a .gmt formatted symbol file.
You can familiarize yourself with the format by checking out :

# default geneset to be used
data("VP2008")

If you have gene and pathway names in a data.frame, you can use split function to create your own .gmt formatted gene sets e.g. split(df$genes, df$pathways).

Selecting different reference genomes

For now, cinaR supports 3 genomes for human and mice models:

hg38
hg19
mm10

You can set your it using reference.genome argument.

Batch Effect Correction

If you suspect your data have unknown batch effects, you can use:

cinaR(..., batch.correction = T)

This option will run Surrogate Variable Analysis (SVA) and try to adjust your data for unknown batch effects. If however, you already know the batches of the samples, you can simply set the batch.information argument as well. This will not run the SVA but just add the batches to design matrix.

# runs SVA
cinaR(..., batch.correction = T)

# runs SVA with 2 surrogate variables
cinaR(..., batch.correction = T, sv.number = 2)

# adds only batch information to the design matrix! (does not run SVA)
# batch.information should be number a vector where
# the length of it equals to the number of samples.
cinaR(..., batch.correction = T, batch.information = c(rep(0, 11), rep(1,11)))

Reminder - In our example data we have 22 samples

Adding extra covariates

Sometimes, one might want to add additional covariates to adjust the design matrix further, which affects the down-stream analyses. For instance, ages or sexes of the samples could be additional covariates. To account for those:

# Ages of the samples could be not in biological interests but should be accounted for!
cinaR(..., additional.covariates = c((18, 11), (3, 11)))

# More than one covariate for instance, sex and age
sex.info <- c("M", "F", "M", "F", "F", "F", "F", "F", "M", "M", "M", 
              "F", "F", "M", "M", "M", "F", "F", "M", "M", "F", "M")

age.info <- c((18, 11), (3, 11)
covs <- data.frame(Sex = sex.info, Age = age.info)

cinaR(..., additional.covariates = covs)

Saving DA peaks to excel

Setting save.DA.peaks = TRUE in cinaR function will create a DApeaks.xlsx file in the current directory. This file includes all the comparisons in different tabs. Additionally, you can set the path/name of the file using DA.peaks.path argument after setting save.DA.peaks = TRUE.

For instance,

results <- cinaR(..., save.DA.peaks = T, DA.peaks.path = "./Peaks_mice.xlsx")

will create an excel file with name Peaks_mice.xlsx in the current directory.

Using different GLM algorithms

Currently, cinaR supports 4 different algorithms, namely;

edgeR
limma-voom
limma-trend
DESeq2

If not set, it uses edgeR for differential analyses. You can change the used algorithm by simply setting DA.choice argument. For more information, ?cinaR

Some useful arguments

# new FDR threshold for DA peaks
results <- cinaR(..., DA.fdr.threshold = 0.1)

# filters out pathways
results <- cinaR(..., enrichment.FDR.cutoff = 0.1)

# does not run enrichment pipeline
results <- cinaR(..., run.enrichment = FALSE)

# creates the piechart from chIpSeeker package
results <- cinaR(..., show.annotation.pie = TRUE)

# change cut-off value for dot plots
dot_plot(..., fdr.cutoff = 0.05)

References

Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47.
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. 10.1186/s13059-014-0550-8

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] cinaR_0.2.3
## 
## loaded via a namespace (and not attached):
##   [1] shadowtext_0.0.9                         
##   [2] ChIPseeker_1.28.3                        
##   [3] fastmatch_1.1-3                          
##   [4] BiocFileCache_2.0.0                      
##   [5] plyr_1.8.6                               
##   [6] igraph_1.2.9                             
##   [7] lazyeval_0.2.2                           
##   [8] splines_4.1.0                            
##   [9] BiocParallel_1.26.2                      
##  [10] GenomeInfoDb_1.28.4                      
##  [11] ggplot2_3.3.5                            
##  [12] digest_0.6.29                            
##  [13] yulab.utils_0.0.4                        
##  [14] htmltools_0.5.2                          
##  [15] GOSemSim_2.18.1                          
##  [16] viridis_0.6.2                            
##  [17] GO.db_3.13.0                             
##  [18] fansi_0.5.0                              
##  [19] magrittr_2.0.1                           
##  [20] memoise_2.0.1                            
##  [21] limma_3.48.3                             
##  [22] Biostrings_2.60.2                        
##  [23] graphlayouts_0.7.2                       
##  [24] matrixStats_0.61.0                       
##  [25] enrichplot_1.12.3                        
##  [26] prettyunits_1.1.1                        
##  [27] colorspace_2.0-2                         
##  [28] blob_1.2.2                               
##  [29] rappdirs_0.3.3                           
##  [30] ggrepel_0.9.1                            
##  [31] xfun_0.29                                
##  [32] dplyr_1.0.7                              
##  [33] crayon_1.4.2                             
##  [34] RCurl_1.98-1.5                           
##  [35] jsonlite_1.7.2                           
##  [36] scatterpie_0.1.7                         
##  [37] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
##  [38] ape_5.5                                  
##  [39] glue_1.5.1                               
##  [40] polyclip_1.10-0                          
##  [41] gtable_0.3.0                             
##  [42] zlibbioc_1.38.0                          
##  [43] XVector_0.32.0                           
##  [44] DelayedArray_0.18.0                      
##  [45] BiocGenerics_0.38.0                      
##  [46] scales_1.1.1                             
##  [47] DOSE_3.18.3                              
##  [48] pheatmap_1.0.12                          
##  [49] DBI_1.1.1                                
##  [50] edgeR_3.34.1                             
##  [51] Rcpp_1.0.7                               
##  [52] plotrix_3.8-2                            
##  [53] viridisLite_0.4.0                        
##  [54] progress_1.2.2                           
##  [55] tidytree_0.3.6                           
##  [56] gridGraphics_0.5-1                       
##  [57] bit_4.0.4                                
##  [58] stats4_4.1.0                             
##  [59] httr_1.4.2                               
##  [60] fgsea_1.18.0                             
##  [61] gplots_3.1.1                             
##  [62] RColorBrewer_1.1-2                       
##  [63] ellipsis_0.3.2                           
##  [64] pkgconfig_2.0.3                          
##  [65] XML_3.99-0.8                             
##  [66] farver_2.1.0                             
##  [67] sass_0.4.0                               
##  [68] dbplyr_2.1.1                             
##  [69] locfit_1.5-9.4                           
##  [70] utf8_1.2.2                               
##  [71] labeling_0.4.2                           
##  [72] ggplotify_0.1.0                          
##  [73] tidyselect_1.1.1                         
##  [74] rlang_1.0.2                              
##  [75] reshape2_1.4.4                           
##  [76] AnnotationDbi_1.54.1                     
##  [77] munsell_0.5.0                            
##  [78] tools_4.1.0                              
##  [79] cachem_1.0.6                             
##  [80] cli_3.1.0                                
##  [81] generics_0.1.1                           
##  [82] RSQLite_2.2.9                            
##  [83] evaluate_0.14                            
##  [84] stringr_1.4.0                            
##  [85] fastmap_1.1.0                            
##  [86] yaml_2.2.1                               
##  [87] ggtree_3.0.4                             
##  [88] knitr_1.36                               
##  [89] bit64_4.0.5                              
##  [90] tidygraph_1.2.0                          
##  [91] caTools_1.18.2                           
##  [92] purrr_0.3.4                              
##  [93] KEGGREST_1.32.0                          
##  [94] ggraph_2.0.5                             
##  [95] nlme_3.1-153                             
##  [96] aplot_0.1.1                              
##  [97] DO.db_2.9                                
##  [98] xml2_1.3.3                               
##  [99] biomaRt_2.48.3                           
## [100] compiler_4.1.0                           
## [101] rstudioapi_0.13                          
## [102] filelock_1.0.2                           
## [103] curl_4.3.2                               
## [104] png_0.1-7                                
## [105] treeio_1.16.2                            
## [106] tibble_3.1.6                             
## [107] tweenr_1.0.2                             
## [108] bslib_0.3.1                              
## [109] stringi_1.7.6                            
## [110] highr_0.9                                
## [111] GenomicFeatures_1.44.2                   
## [112] lattice_0.20-45                          
## [113] Matrix_1.4-0                             
## [114] vctrs_0.3.8                              
## [115] pillar_1.6.4                             
## [116] lifecycle_1.0.1                          
## [117] jquerylib_0.1.4                          
## [118] data.table_1.14.2                        
## [119] cowplot_1.1.1                            
## [120] bitops_1.0-7                             
## [121] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [122] patchwork_1.1.1                          
## [123] rtracklayer_1.52.1                       
## [124] GenomicRanges_1.44.0                     
## [125] qvalue_2.24.0                            
## [126] R6_2.5.1                                 
## [127] BiocIO_1.2.0                             
## [128] KernSmooth_2.23-20                       
## [129] gridExtra_2.3                            
## [130] IRanges_2.26.0                           
## [131] gtools_3.9.2                             
## [132] boot_1.3-28                              
## [133] MASS_7.3-54                              
## [134] assertthat_0.2.1                         
## [135] SummarizedExperiment_1.22.0              
## [136] rjson_0.2.20                             
## [137] GenomicAlignments_1.28.0                 
## [138] Rsamtools_2.8.0                          
## [139] S4Vectors_0.30.2                         
## [140] GenomeInfoDbData_1.2.6                   
## [141] parallel_4.1.0                           
## [142] hms_1.1.1                                
## [143] grid_4.1.0                               
## [144] ggfun_0.0.4                              
## [145] tidyr_1.1.4                              
## [146] rmarkdown_2.11                           
## [147] MatrixGenerics_1.4.3                     
## [148] ggforce_0.3.3                            
## [149] Biobase_2.52.0                           
## [150] restfulr_0.0.13