Introduction to hacksig

Hacksig is a collection of cancer transcriptomics gene signatures as well as a simple and tidy interface to compute single sample enrichment scores.

This document will show you how to getting started with hacksig, but first, we must load the following packages:

Get implemented signatures

In order to get a complete list of the implemented signatures, you can use get_sig_info(). It returns a tibble with very useful information:

the signature_id;
a string of keywords associated to a signature (separated by the “pipe” | symbol);
the publication_doi linking to the original publication;
a brief description.

get_sig_info()
#> # A tibble: 23 × 4
#>   signature_id       signature_keywords              publication_doi description
#>   <chr>              <chr>                           <chr>           <chr>      
#> 1 ayers2017_immexp   ayers2017_immexp|immune expand… 10.1172/JCI911… Immune exp…
#> 2 bai2019_immune     bai2019_immune|head and neck|h… 10.1155/2019/3… Immune/inf…
#> 3 cinsarc            cinsarc|metastasis|sarcoma|sts  10.1038/nm.2174 Biomarker …
#> 4 dececco2014_int172 dececco2014_int172|head and ne… 10.1093/annonc… Signature …
#> 5 eschrich2009_rsi   eschrich2009_rsi|radioresistan… 10.1016/j.ijro… Genes aime…
#> # … with 18 more rows

To get a full view of the tibble, use:

get_sig_info() %>% print(n = Inf)
# or
View(get_sig_info())

If you want to get the list of gene symbols for one or more of the implemented signatures, then use get_sig_genes() with valid keywords:

get_sig_genes("ifng")
#> $muro2016_ifng
#> [1] "CXCL10"  "CXCL9"   "HLA-DRA" "IDO1"    "IFNG"    "STAT1"

Check your signatures

The first thing you should do before computing scores for a signature is to check how many of its genes are present in your data. To accomplish this, we can use check_sig() on a normalized gene expression matrix (either microarray or RNA-seq normalized data), which must be formatted as an object of class matrix or data.frame with gene symbols as row names and sample IDs as column names.

For this tutorial, we will use test_expr (an R object included in hacksig) as an example gene expression matrix with 20 simulated samples.

By default, check_sig() will compute statistics for every signature implemented in hacksig.

check_sig(test_expr)
#> # A tibble: 23 × 5
#>   signature_id     n_genes n_present frac_present missing_genes
#>   <chr>              <int>     <int>        <dbl> <named list> 
#> 1 muro2016_ifng          6         4        0.667 <chr [2]>    
#> 2 wu2020_metabolic      30        20        0.667 <chr [10]>   
#> 3 liu2020_immune         6         4        0.667 <chr [2]>    
#> 4 liu2021_mgs            6         4        0.667 <chr [2]>    
#> 5 estimate_stromal     141        91        0.645 <chr [50]>   
#> # … with 18 more rows

You can filter for specific signatures by entering keywords in the signatures argument (partial match and regular expressions will work):

check_sig(test_expr, signatures = c("metab", "cinsarc"))
#> # A tibble: 2 × 5
#>   signature_id     n_genes n_present frac_present missing_genes
#>   <chr>              <int>     <int>        <dbl> <named list> 
#> 1 wu2020_metabolic      30        20        0.667 <chr [10]>   
#> 2 cinsarc               67        40        0.597 <chr [27]>

We can also check for signatures not implemented in hacksig, that is custom signatures. For example, we can use the msigdbr package to download the Hallmark gene set collection as a tibble and transform it into a list:

hallmark_list <- msigdbr(species = "Homo sapiens", category = "H") %>%
    distinct(gs_name, gene_symbol) %>%
    nest(genes = c(gene_symbol)) %>%
    mutate(genes = map(genes, compose(as_vector, unname))) %>%
    deframe()

check_sig(test_expr, hallmark_list)
#> # A tibble: 50 × 5
#>   signature_id                      n_genes n_present frac_present missing_genes
#>   <chr>                               <int>     <int>        <dbl> <named list> 
#> 1 HALLMARK_WNT_BETA_CATENIN_SIGNAL…      42        27        0.643 <chr [15]>   
#> 2 HALLMARK_APICAL_SURFACE                44        28        0.636 <chr [16]>   
#> 3 HALLMARK_BILE_ACID_METABOLISM         112        70        0.625 <chr [42]>   
#> 4 HALLMARK_NOTCH_SIGNALING               32        20        0.625 <chr [12]>   
#> 5 HALLMARK_PI3K_AKT_MTOR_SIGNALING      105        65        0.619 <chr [40]>   
#> # … with 45 more rows

Missing genes for the HALLMARK_NOTCH_SIGNALING gene set are:

check_sig(test_expr, hallmark_list) %>% 
    filter(signature_id == "HALLMARK_NOTCH_SIGNALING") %>% 
    pull(missing_genes)
#> $HALLMARK_NOTCH_SIGNALING
#>  [1] "FZD5"    "HEYL"    "KAT2A"   "MAML2"   "NOTCH1"  "NOTCH3"  "PPARD"  
#>  [8] "PRKCA"   "PSEN2"   "SAP30"   "ST3GAL6" "TCF7L2"

Compute single sample scores

hack_sig

The main function of the package, hack_sig(), permits to obtain single sample scores from gene signatures. By default, it will compute scores for all the signatures implemented in the package with the original publication method.

hack_sig(test_expr)
#> Warning: ℹ No genes are present in 'expr_data' for the following signatures:
#> x rooney2015_cyt
#> ℹ To obtain CINSARC, ESTIMATE and Immunophenoscore with the original procedures, see:
#> ?hack_cinsarc
#> ?hack_estimate
#> ?hack_immunophenoscore
#> # A tibble: 20 × 16
#>   sample_id ayers2017_immexp bai2019_immune dececco2014_int172 eschrich2009_rsi
#>   <chr>                <dbl>          <dbl>              <dbl>            <dbl>
#> 1 sample1               5.71          -22.3               2.62           0.0289
#> 2 sample2               5.88          -29.9               2.17           0.565 
#> 3 sample3               8.12          -27.5               2.25           0.456 
#> 4 sample4               7.82          -41.0               2.00           0.479 
#> 5 sample5               7.88          -29.9               2.51           0.355 
#> # … with 15 more rows, and 11 more variables: eustace2013_hypoxia <dbl>,
#> #   fang2021_irgs <dbl>, hu2021_derbp <dbl>, li2021_irgs <dbl>,
#> #   liu2020_immune <dbl>, liu2021_mgs <dbl>, lohavanichbutr2013_hpvneg <dbl>,
#> #   muro2016_ifng <dbl>, qiang2021_irgs <dbl>, she2020_irgs <dbl>,
#> #   wu2020_metabolic <dbl>

You can also filter for specific signatures (e.g. the immune and stromal ESTIMATE signatures) and choose a particular single sample method:

hack_sig(test_expr, signatures = "estimate", method = "zscore")
#> # A tibble: 20 × 3
#>   sample_id estimate_immune estimate_stromal
#>   <chr>               <dbl>            <dbl>
#> 1 sample1            -2.65            -0.262
#> 2 sample2             1.19            -0.717
#> 3 sample3            -0.455            0.254
#> 4 sample4            -0.722            2.19 
#> 5 sample5            -1.07             0.112
#> # … with 15 more rows

Valid choices for single sample methods are:

"zscore", for the combined z-score;
"ssgsea", for the single sample GSEA;
"singscore", for the singscore method.

Run ?hack_sig to see additional parameter specifications for these methods.

As in check_sig(), the argument signatures can also be a list of gene signatures. For example, we can compute normalized single sample GSEA scores for the Hallmark gene sets:

hack_sig(test_expr, hallmark_list, 
         method = "ssgsea", sample_norm = "separate", alpha = 0.5)
#> # A tibble: 20 × 51
#>   sample_id HALLMARK_ADIPOGE… HALLMARK_ALLOGR… HALLMARK_ANDROG… HALLMARK_ANGIOG…
#>   <chr>                 <dbl>            <dbl>            <dbl>            <dbl>
#> 1 sample1               0.683            0.419            0.943           0.447 
#> 2 sample2               0.501            0.554            0.643           0.612 
#> 3 sample3               0.387            0.368            0.695           0.409 
#> 4 sample4               0.710            0.524            0.793           0.502 
#> 5 sample5               0.339            0.422            0.917           0.0500
#> # … with 15 more rows, and 46 more variables: HALLMARK_APICAL_JUNCTION <dbl>,
#> #   HALLMARK_APICAL_SURFACE <dbl>, HALLMARK_APOPTOSIS <dbl>,
#> #   HALLMARK_BILE_ACID_METABOLISM <dbl>,
#> #   HALLMARK_CHOLESTEROL_HOMEOSTASIS <dbl>, HALLMARK_COAGULATION <dbl>,
#> #   HALLMARK_COMPLEMENT <dbl>, HALLMARK_DNA_REPAIR <dbl>,
#> #   HALLMARK_E2F_TARGETS <dbl>,
#> #   HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION <dbl>, …

There are three methods for which hack_sig() cannot be used to compute gene signature scores with the original method. These are: CINSARC, ESTIMATE and the Immunophenoscore.

hack_cinsarc

For the CINSARC classification, you must provide a vector with distant metastasis status:

set.seed(123)
rand_dm <- sample(c(0, 1), size = ncol(test_expr), replace = TRUE)
hack_cinsarc(test_expr, rand_dm)
#> # A tibble: 20 × 2
#>   sample_id cinsarc_class
#>   <chr>     <chr>        
#> 1 sample1   C2           
#> 2 sample2   C1           
#> 3 sample3   C2           
#> 4 sample4   C1           
#> 5 sample5   C2           
#> # … with 15 more rows

hack_estimate

Immune, stromal, ESTIMATE and tumor purity scores from the ESTIMATE method can be obtained with:

hack_estimate(test_expr)
#> # A tibble: 20 × 5
#>   sample_id immune_score stroma_score estimate_score purity_score
#>   <chr>            <dbl>        <dbl>          <dbl>        <dbl>
#> 1 sample1          -636.         778.           142.        0.811
#> 2 sample2          2118.         703.          2821.        0.524
#> 3 sample3           725.         805.          1530.        0.675
#> 4 sample4           737.        2031.          2768.        0.531
#> 5 sample5           181.        1129.          1310.        0.699
#> # … with 15 more rows

hack_immunophenoscore

Finally, the raw immunophenoscore and its discrete (0-10 normalized) counterpart can be obtained with:

hack_immunophenoscore(test_expr)
#> # A tibble: 20 × 3
#>   sample_id raw_score ips_score
#>   <chr>         <dbl>     <dbl>
#> 1 sample1      0.942          3
#> 2 sample2     -0.348          0
#> 3 sample3      0.0939         0
#> 4 sample4     -0.335          0
#> 5 sample5      1.64           5
#> # … with 15 more rows

You can also obtain all biomarker scores with:

hack_immunophenoscore(test_expr, extract = "all")
#> # A tibble: 20 × 19
#>   sample_id raw_score ips_score ec_score mhc_score sc_score cp_score
#>   <chr>         <dbl>     <dbl>    <dbl>     <dbl>    <dbl>    <dbl>
#> 1 sample1      0.942          3  -0.148      0.312   0.652    0.126 
#> 2 sample2     -0.348          0   0.0679    -0.796  -0.0268   0.407 
#> 3 sample3      0.0939         0   0.0347    -0.483  -0.0843   0.627 
#> 4 sample4     -0.335          0   0.292     -0.777   0.166   -0.0161
#> 5 sample5      1.64           5   0.0607     0.876  -0.0742   0.778 
#> # … with 15 more rows, and 12 more variables: act_cd8_score <dbl>,
#> #   tem_cd8_score <dbl>, tem_cd4_score <dbl>, b2m_score <dbl>,
#> #   act_cd4_score <dbl>, treg_score <dbl>, mdsc_score <dbl>, cd27_score <dbl>,
#> #   icos_score <dbl>, pd1_score <dbl>, pdl2_score <dbl>, tigit_score <dbl>

Stratify your samples

If you want to categorize your samples into two or more signature classes based on a score cutoff, you can use hack_class() after hack_sig():

test_expr %>% 
    hack_sig("estimate", method = "singscore", direction = "up") %>% 
    hack_class()
#> # A tibble: 20 × 3
#>   sample_id estimate_immune estimate_stromal
#>   <chr>     <chr>           <chr>           
#> 1 sample1   low             low             
#> 2 sample2   high            low             
#> 3 sample3   low             low             
#> 4 sample4   low             high            
#> 5 sample5   low             high            
#> # … with 15 more rows

By default, hack_class() will stratify samples either with the original publication method (if any) or by the median score (otherwise). hack_class() will work only with signatures implemented in hacksig.

Speed-up computation time

Our rank-based single sample method implementations (i.e. single sample GSEA and singscore) are slower than their counterparts implemented in GSVA and singscore. Hence, to speed-up computation time you can use the future package:

plan(multisession)
hack_sig(test_expr, hallmark_list, method = "ssgsea")
#> # A tibble: 20 × 51
#>   sample_id HALLMARK_ADIPOGE… HALLMARK_ALLOGR… HALLMARK_ANDROG… HALLMARK_ANGIOG…
#>   <chr>                 <dbl>            <dbl>            <dbl>            <dbl>
#> 1 sample1               1593.             709.            2013.            1396.
#> 2 sample2               1279.            1052.            1067.            1672.
#> 3 sample3               1061.             661.            1151.            1145.
#> 4 sample4               1613.             954.            1677.            1586.
#> 5 sample5                818.             804.            1866.             622.
#> # … with 15 more rows, and 46 more variables: HALLMARK_APICAL_JUNCTION <dbl>,
#> #   HALLMARK_APICAL_SURFACE <dbl>, HALLMARK_APOPTOSIS <dbl>,
#> #   HALLMARK_BILE_ACID_METABOLISM <dbl>,
#> #   HALLMARK_CHOLESTEROL_HOMEOSTASIS <dbl>, HALLMARK_COAGULATION <dbl>,
#> #   HALLMARK_COMPLEMENT <dbl>, HALLMARK_DNA_REPAIR <dbl>,
#> #   HALLMARK_E2F_TARGETS <dbl>,
#> #   HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION <dbl>, …

Use case

Let’s say we want to compute single sample scores for the KEGG gene set collection and then correlate these scores with the tumor purity given by the ESTIMATE method.

First, we get the KEGG list and use check_sig() to keep only those gene sets whose genes are more than 2/3 present in our gene expression matrix.

kegg_list <- msigdbr(species = "Homo sapiens", subcategory = "KEGG") %>%
    distinct(gs_name, gene_symbol) %>%
    nest(genes = c(gene_symbol)) %>%
    mutate(genes = map(genes, compose(as_vector, unname))) %>%
    deframe()

kegg_ok <- check_sig(test_expr, kegg_list) %>% 
    filter(frac_present > 0.66) %>% 
    pull(signature_id)

Then, we apply both the combined z-score and the ssGSEA method for the resulting list of 10 KEGG gene sets using purrr::map_dfr():

kegg_scores <- map_dfr(list(zscore = "zscore", ssgsea = "ssgsea"), 
                       ~ hack_sig(test_expr,
                                  kegg_list[kegg_ok],
                                  method = .x,
                                  sample_norm = "separate"),
                       .id = "method")

We can transform the kegg_scores tibble in long format using tidyr::pivot_longer():

kegg_scores <- kegg_scores %>% 
    pivot_longer(starts_with("KEGG"), 
                 names_to = "kegg_id", values_to = "kegg_score")

Finally, after computing the tumor purity scores, we can merge the two data sets and plot the results:

purity_scores <- hack_estimate(test_expr) %>% select(sample_id, purity_score)

kegg_scores %>% 
    left_join(purity_scores, by = "sample_id") %>% 
    ggplot(aes(x = kegg_id, y = kegg_score)) +
    geom_boxplot(outlier.alpha = 0) +
    geom_jitter(aes(color = purity_score), alpha = 0.8, width = 0.1) +
    facet_wrap(facets = vars(method), scales = "free_x") +
    coord_flip() +
    scale_color_viridis_c() +
    labs(x = NULL, y = "enrichment score", color = "Tumor purity") +
    theme(legend.position = "top")