Handling the input matrix in CB2

Hyun-Hwan Jeong

2026-01-29

If an analysis starts with an input matrix, it has to be appropriately pre-proceed before it is used any functions of CB2. CB2 allows two different types of input: a numeric matrix/data frame with row.names and a data.frame contains columns of counts and columns of sgRNA IDs and target genes. Either of them will work. This document explains how the input should be formed and how to process the input using CB2. In the entire document, (Evers et al. 2016)’s CRISPR-RT112 screen data are used.

The following code imports required packages which are required to run below codes.

library(CB2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.4.3

The following code block shows an example of the first type of input which CB2 can handle. Each column of Evers_CRISPRn_RT112$count contains counts of guide RNAs of a sample (that was initially extracted from NGS data). A count of the input shows that how many guide RNA barcodes were observed from a given NGS sample. Each row of the matrix has a row name (e.g., RPS19_sg10), and the name is the ID of a guide RNA. For example, RPS19_sg10, which is the first-row name in the example, indicates that the first row contains the counts of RPS_sg10 guide RNA. Every guide RNA ID must have exactly one _ character, and it is used to be a separator of two strings. The first string displays the name of a gene whose gene is targeted by the guide RNA, and the second string is used as an identifier among guide RNAs that targets the same gene. For example, RPS_sg10 indicates that the guide RNA is designed to target the RPS gene, and sg10 is the unique identifier.

NOTE : If the input contains multiple _ characters, CB2 is not able to run. In particular, if Entrez gene IDs are used as the gene names, CB2 does not handle the input. One of the solutions for this case is changing the gene names to another identifier (e.g., HGNC symbol) or using another type of input, which will explain below.

data("Evers_CRISPRn_RT112")
head(Evers_CRISPRn_RT112$count)

##               B1   B2   B3   A1   A2   A3
## RPS19_sg10 10180 9768 9406 1005 1186  911
## NUP93_sg4   9073 8598 8363  688  479  581
## PSMD11_sg2  9408 9573 9384 1014 1196  797
## RPS3A_sg3   8922 8965 8779 1266 1303 1176
## PSMB3_sg3   7434 6958 6871  450  628  398
## POLR2A_sg5  7672 7537 7259  647  786  683

In addition, CB2 requires experiment design information which is formed as a data.frame and contains sample names and groups of each sample. In Evers_CRISPRn_RT112 data, Evers_CRISPRn_RT112$design is the data.frame.

Evers_CRISPRn_RT112$design

## # A tibble: 6 × 2
##   group  sample_name
##   <chr>  <chr>      
## 1 before B1         
## 2 before B2         
## 3 before B3         
## 4 after  A1         
## 5 after  A2         
## 6 after  A3

With the two variables, CB2 can perform the hypothesis test with measure_sgrna_stats and measure_gene_stats functions.

sgrna_stats <- measure_sgrna_stats(Evers_CRISPRn_RT112$count, Evers_CRISPRn_RT112$design, "before", "after")
gene_stats <- measure_gene_stats(sgrna_stats)
head(gene_stats)

## # A tibble: 6 × 11
##   gene  n_sgrna cpm_a cpm_b  logFC     p_ts     p_pa     p_pb   fdr_ts   fdr_pa
##   <chr>   <int> <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 ABCG8      10 1101. 1540.  0.481 1.03e-18 1   e+ 0 2.41e-21 1.56e-18 1   e+ 0
## 2 ADH7       10  799. 1173.  0.522 4.88e-20 1   e+ 0 1.10e-22 8.55e-20 1   e+ 0
## 3 CABP5      10 1171. 1662.  0.493 2.10e-21 1   e+ 0 4.55e-24 4.64e-21 1   e+ 0
## 4 COPA       10  860.  413. -1.51  4.41e-33 5.48e-33 8.12e- 1 1.37e-31 1.70e-31
## 5 COPB1      10 1054.  498. -1.57  3.29e-25 3.04e-24 5.31e- 1 1.26e-24 1.29e-23
## 6 COPS2      10 1055.  842. -0.728 6.56e-19 4.85e-12 1.43e- 4 1.03e-18 1.13e-11
## # ℹ 1 more variable: fdr_pb <dbl>

Another input type is a data.frame that contains two additional columns, which contain the guide RNA information (target gene and guide RNA identifier). A CSV file which was used in the CB2 publication ((Jeong et al. 2019)). The CSV file contains the additional columns, the first is the gene column, and the second is the sgRNA column.

df <- read_csv("https://raw.githubusercontent.com/hyunhwaj/CB2-Experiments/master/01_gene-level-analysis/data/Evers/CRISPRn-RT112.csv")

## Rows: 961 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gene, sgRNA
## dbl (6): B1, B2, B3, A1, A2, A3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df

## # A tibble: 961 × 8
##    gene   sgRNA       B1    B2    B3    A1    A2    A3
##    <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 RPS19  RPS19-10 10180  9768  9406  1005  1186   911
##  2 NUP93  NUP93-4   9073  8598  8363   688   479   581
##  3 PSMD11 PSMD11-2  9408  9573  9384  1014  1196   797
##  4 RPS3A  RPS3A-3   8922  8965  8779  1266  1303  1176
##  5 PSMB3  PSMB3-3   7434  6958  6871   450   628   398
##  6 POLR2A POLR2A-5  7672  7537  7259   647   786   683
##  7 RPS8   RPS8-8    6999  6966  6775   673   665   474
##  8 RPL36  RPL36-3   5988  5893  5498   266   276   201
##  9 RPS13  RPS13-4   5763  5546  5131   187   191   196
## 10 RPS11  RPS11-2   9444  9621  8835  1900  2213  1705
## # ℹ 951 more rows

Two additional parameters have to be set to the measure_sgrna_stats function if an input matrix is this type. The first parameter is ge_id, which specifies the column of genes, and the second parameter is sg_id, which specifies the column of IDs. In the following code, gene_id sets as gene and sg_id sets as sgRNA.

head(measure_sgrna_stats(df, Evers_CRISPRn_RT112$design, "before", "after", ge_id = 'gene', sg_id = 'sgRNA'))

##     gene    sgRNA n_a n_b      phat_a       vhat_a       phat_b       vhat_b
## 1  RPS19 RPS19-10   3   3 0.002323257 5.369646e-10 0.0003257504 4.174644e-10
## 2  NUP93  NUP93-4   3   3 0.002060560 7.828460e-10 0.0001838979 3.300861e-10
## 3 PSMD11 PSMD11-2   3   3 0.002244954 1.772784e-10 0.0003147378 8.295566e-10
## 4  RPS3A  RPS3A-3   3   3 0.002110486 1.666823e-10 0.0003935102 4.133227e-11
## 5  PSMB3  PSMB3-3   3   3 0.001683100 8.591075e-10 0.0001547065 3.831946e-10
## 6 POLR2A POLR2A-5   3   3 0.001778234 1.404884e-10 0.0002229180 2.109228e-10
##      cpm_a    cpm_b     logFC    t_value       df         p_ts         p_pa
## 1 2323.275 325.7290 -2.830614  -64.65714 3.938262 4.148366e-07 2.074183e-07
## 2 2060.586 183.9145 -3.478824  -56.25377 3.432003 3.271576e-06 1.635788e-06
## 3 2246.609 314.6832 -2.831842  -60.83125 2.817477 1.761248e-05 8.806242e-06
## 4 2111.801 393.9205 -2.419522 -119.04667 2.934424 1.685127e-06 8.425636e-07
## 5 1683.145 154.6872 -3.435294  -43.36323 3.488096 6.833235e-06 3.416617e-06
## 6 1778.610 222.9897 -2.990057  -82.96805 3.845513 2.121616e-07 1.060808e-07
##        p_pb       fdr_ts       fdr_pa    fdr_pb
## 1 0.9999998 4.714939e-05 2.357470e-05 0.9999999
## 2 0.9999984 6.164676e-05 3.273760e-05 0.9999999
## 3 0.9999912 1.071240e-04 7.693454e-05 0.9999999
## 4 0.9999992 5.772537e-05 2.886268e-05 0.9999999
## 5 0.9999966 8.042437e-05 4.512440e-05 0.9999999
## 6 0.9999999 4.714939e-05 2.357470e-05 0.9999999

References

Evers, Bastiaan, Katarzyna Jastrzebski, Jeroen PM Heijmans, Wipawadee Grernrum, Roderick L Beijersbergen, and Rene Bernards. 2016. “CRISPR Knockout Screening Outperforms shRNA and CRISPRi in Identifying Essential Genes.” Nature Biotechnology 34 (6): 631.

Jeong, Hyun-Hwan, Seon Young Kim, Maxime WC Rousseaux, Huda Y Zoghbi, and Zhandong Liu. 2019. “Beta-Binomial Modeling of CRISPR Pooled Screen Data Identifies Target Genes with Greater Sensitivity and Fewer False Negatives.” Genome Research 29 (6): 999–1008.