Author: Ville-Petteri Mäkinen
Version: 1.0.7

Abstract: Allspice is a lightweight machine learning tool that was initially developed for RNA profiling of acute lymphoplastic leukemia, but it can be used for any problem where multiple classes need to be identified from multi-dimensional data. The classifier contains optimized mean profiles for the classes (centroids) as observed in the training data, and new samples are matched to these centroids using the shortest Euclidean distance. Allspice produces both numerical and visual output of the classification results and flags samples with mixed features from multiple classes or atypical values.

Citation: Ville-Petteri Mäkinen, Jacqueline Rehn, James Breen, David Yeung, Deborah L White (2022) Multi-cohort transcriptomic subtyping of B-cell acute lymphoblastic leukemia, Int J Mol Sci 2022, 23(9), 4574, https://doi.org/10.3390/ijms23094574

Resources: Additional assets were created from tissue-specific RNA-seq data from the Genotype-Tissue Expression Project v8 [Lonsdale J et al. Nat Genet 45, 580–585 (2013) https://doi.org/10.1038/ng.2653], from lymphoblastoid cell lines [Lappalainen et al. Nature 2013 September 26; 501(7468): 506–511. doi:10.1038/nature12531], from pediatric T-cell leukemia samples [Drobna-Śledzińska et al. ArrayExpress E-MTAB-11759, accessed 2022], and from 11 EPT and 134 T-cell ALL samples (South Australian Health and Medical Research Institute, 2022, unpublished).

image/svg+xml UnadjustedRNA data AdjustedRNA data UnadjustedRNA data AdjustedRNA data mRNA-seq Subtypes Drivers Tissues

1 Getting started

You can install Allspice using the standard procedure:

# Install the package from a remote repository.
install.packages("Allspice")

To activate and list the library functions, type

# Activate the library.
library("Allspice")
packageVersion("Allspice")
## [1] '1.0.7'
ls("package:Allspice")
##  [1] "assemble<-"      "asset"           "bcellALL"        "classifier"     
##  [5] "classify"        "configuration"   "configuration<-" "covariates<-"   
##  [9] "export"          "information"     "nomenclature<-"  "normalize"      
## [13] "predictions"     "profiles<-"      "report"          "scores"         
## [17] "standardize"     "visuals<-"

Each Allspice function comes with a help page that contains a code example, such as

# Access function documentation (not shown in vignette).
? Allspice::classifier

You can run all the code examples in one go by typing

# Run all code examples (not shown in vignette).
fn <- system.file("examples.R", package = "Allspice")
source(fn)

2 Dataset

To run this tutorial, we will use simulated data that mimics B-cell acute lymphoblast leukemia samples:

# Generate gene RNA read counts.
simu <- bcellALL(300)
print(simu$counts[1:5,1:6])
##        patient1 patient2 patient3 patient4 patient5 patient6
## NFYA       3198     3214     2125     2137     3250     3676
## STPG1        47       48      299       55       29       98
## ENPP4       586       38       75      196     1863       55
## SEMA3F      527       87      198       69     1707      132
## LAP3        863     1838     3187     2192     1149     2312
print(simu$metadata[1:6,])
##          MALE AGE    SUBTYPE
## patient1    0  12 ETV6.RUNX1
## patient2    0  27         Ph
## patient3    1  27      KMT2A
## patient4    1  34  PAX5.p80r
## patient5    1  29 ETV6.RUNX1
## patient6    1  50     ZNF384

The output contains the unadjusted read counts as would be produced by a standard RNA-seq pipeline and patients’ sex and age are also simulated. Please note that these data are artificial and meant only for demonstration purposes; they do not reflect the complex biology of ALL accurately.

The read counts must be stored in a matrix that has the samples organized into columns and the genes are listed as rows. For the meta-data, patients can be listed in rows as that is more convenient in most situations.

3 Classify samples

3.1 Predict labels

The first step is to create a new classifier:

# Set up a classifier for genetic B-cell ALL subtypes.
cls <- classifier()
## 
## ALL subtype
##     configuration -> 164 bytes
##     categories -> 984 bytes
##     reference -> 335,109 bytes
##     covariates -> 103 bytes
##     nomenclature -> 325,618 bytes
##     centroids -> 15,212 bytes
##     coefficients -> 3,150 bytes
## 
## Driver genes
##     configuration -> 165 bytes
##     categories -> 1,461 bytes
##     reference -> 333,812 bytes
##     covariates -> 102 bytes
##     nomenclature -> 324,271 bytes
##     centroids -> 22,764 bytes
##     coefficients -> 4,796 bytes
## 
## Source tissue
##     configuration -> 166 bytes
##     categories -> 2,619 bytes
##     reference -> 251,540 bytes
##     covariates -> 105 bytes
##     nomenclature -> 242,601 bytes
##     centroids -> 42,622 bytes
##     coefficients -> 9,144 bytes

By default, the command creates an object that contains three models: i) ALL subtype prediction, ii) driver gene prediction and iii) tissue profiling to make sure the RNA data are suitable for analyses.

The next step is to import the covariates (e.g. sex and age) into the classifier. If there are no covariate data available, this step can be skipped. The default classifier uses MALE and AGE as column headings, see the format of the metadata in the first section of the vignette. The names of covariates are also available as

# List covariates.
info <- information(cls)
print(info$covariates[,c("ASSET","TITLE","COVAR")])
##   ASSET         TITLE COVAR
## 1     1   ALL subtype  MALE
## 2     1   ALL subtype   AGE
## 3     2  Driver genes  MALE
## 4     2  Driver genes   AGE
## 5     3 Source tissue  MALE
## 6     3 Source tissue   AGE

To set the covariates, use the command

# Set covariates.
covariates(cls) <- simu$metadata

Missing values or incomplete data are allowed, however, a warning will be produced when the subtypes are predicted. Also, it is important to set the covariates before loading the RNA read counts.

Once the covariates have been set, the RNA-seq read counts are next:

# Load RNA-seq profiles.
profiles(cls) <- simu$counts

The above command pre-processes the data and predicts the subtypes based on the combined information from covariates and read counts. To get the results, use the command

# Prediction results.
pred <- predictions(cls)
primary <- pred[[1]]
print(primary[1:6,c("LABEL","FREQ","PROX","EXCL")])
##                   LABEL  FREQ  PROX  EXCL
## patient1    ETV6::RUNX1 0.977 0.997 0.998
## patient2 Ph (BCR::ABL1) 0.788 0.951 0.824
## patient3          KMT2A 0.996 1.000 0.997
## patient4      PAX5 P80R 0.841 0.984 0.965
## patient5    ETV6::RUNX1 0.993 0.999 1.000
## patient6         ZNF384 0.992 1.000 1.000

The output is a list of data frames, where each element contains the results generated by a classification asset (see the section on how to create your own classifier for more information on assets). Here, we are interested in the ALL subtype which is the primary asset in the classifier.

Please use the tabs at the top to navigate to the next page.

3.2 Metrics

Here are selected results from the previous page again:

# Prediction results.
ambig <- which(primary$CATEG == "Ambiguous")
uncla <- which(primary$CATEG == "Unclassified")
rows <- unique(c(1:5, ambig[1], uncla[1]))
print(primary[rows,c("LABEL","MATCH","FREQ","PROX","EXCL")])
##                    LABEL      MATCH     FREQ  PROX  EXCL
## patient1     ETV6::RUNX1 ETV6.RUNX1 9.77e-01 0.997 0.998
## patient2  Ph (BCR::ABL1)         Ph 7.88e-01 0.951 0.824
## patient3           KMT2A      KMT2A 9.96e-01 1.000 0.997
## patient4       PAX5 P80R  PAX5.p80r 8.41e-01 0.984 0.965
## patient5     ETV6::RUNX1 ETV6.RUNX1 9.93e-01 0.999 1.000
## patient17      Ambiguous   PAX5.alt 6.20e-01 0.940 0.463
## patient41   Unclassified        HLF 2.27e-05 0.117 0.220

The data frame includes additional information besides the most plausible ALL subtype LABEL. The ALL subtype that most resembles the patient’s profile is defined as the best-matching subtype MATCH. For most cases, LABEL will refer to the same subtype as MATCH, but failure to pass quality checks and the rarity of a subtype may sometimes lead to conflicting results.

The next column named FREQ tells the frequency of the predicted subtype given the observed RNA profile and covariates. For example, if FREQ is 0.9, then you would expect to see 9 out of 10 patients that share this data profile to have that predicted ALL subtype. This metric depends on the training set that was used when the classification asset was created. Thus rare subtypes are likely to get lower frequency estimates even if the RNA profile would be a close match.

The frequency metric does not capture mutual exclusivity because it is evaluated independently for each subtype. For example, Allspice allows patients to have more than one subtype in parallel when the statistical model is trained. In this scenario, parallel subtypes may all get high frequencies, of which the highest would be chosen as the predicted subtype. To identify samples with mixed transcriptional characteristics, the results also includes a probabilistic estimate for exclusivity in the column EXCL that tells how likely it is that a sample comes from a single ALL subtype.

PROX (proximity) and EXCL (exclusivity) are used as quality metrics. By default, Allspice labels any samples with <50% proximity as ‘Unclassified’. Of those that pass the probability threshold, samples with <50% exclusivity are labelled as ‘Ambiguous’. The proximity measure is conceptually the same as subtype frequency, however, any differences in subtype prevalence are adjusted away.

You may have noticed that the subtype labels and the matched subtype names are slightly different. This is because text labels with non-alphanumeric characters may be automatically changed by R when used as row or column names in data structures, thus the subtype identities are made safe compared to the human readable labels. You can access subtype labels by typing:

# Access subtype labelling information.
info <- information(cls)
print(info$categories[1:5,c("ASSET","TITLE","CATEG","LABEL")])
##   ASSET       TITLE      CATEG       LABEL
## 1     1 ALL subtype   BCL2_MYC    BCL2/MYC
## 2     1 ALL subtype  CDX2.high CDX2 hi-exp
## 3     1 ALL subtype      CRLF2       CRLF2
## 4     1 ALL subtype       DUX4        DUX4
## 5     1 ALL subtype ETV6.RUNX1 ETV6::RUNX1

See also the Iris data example in the section on creating new classifiers. It contains instructions on how to set the labels for a classification asset.

Please use the tabs at the top to navigate to the next page.

3.3 Visual report

Detailed classification results can be visualised on a case-by-case basis. In the first example, we first identify a patient sample that passed quality control and did not show mixed characteristics:

# Select successful classification.
rows <- which((primary$CATEG != "Ambiguous") & (primary$FREQ > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
##                LABEL      MATCH  FREQ  PROX  EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 0.977 0.997 0.998
# Show patient report.
report(cls, name = rows[1], file = NULL)
Figure: Classification results. The predicted subtype label is written on the top left corner. The expected frequency of the subtype given the data profile is written as a percentage after the predicted label. RNA biomarker scores were also calculated for the presence of alterations in specific genes (center bar chart), note that a patient may have multiple driver genes in parallel. The right-most chart shows RNA biomarker scores for the predicted source tissue of the sample (in this simulated dataset, all samples were generated from B-cell ALL profiles).

Figure: Classification results. The predicted subtype label is written on the top left corner. The expected frequency of the subtype given the data profile is written as a percentage after the predicted label. RNA biomarker scores were also calculated for the presence of alterations in specific genes (center bar chart), note that a patient may have multiple driver genes in parallel. The right-most chart shows RNA biomarker scores for the predicted source tissue of the sample (in this simulated dataset, all samples were generated from B-cell ALL profiles).

The second example shows a case where the subtype is less obvious due to mixed transcriptional characteristics. Sometimes this can be caused by random incidental fluctuations that are always present in biological data, but some of these cases will be “real” in the sense that the same individual can harbour multiple cancer driving mutations or some alterations affect wide regions of the genome (e.g. deletions or duplications of chromosomal segments) and thus multiple genes would be involved.

# Select ambiguous classification.
rows <- which((primary$CATEG == "Ambiguous") & (primary$PROX > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
##               LABEL    MATCH FREQ PROX  EXCL
## patient17 Ambiguous PAX5.alt 0.62 0.94 0.463

Mixed profiles typically manifest multiple indicators for driver genes while still being identified as a B-cell ALL tissue type:

# Show patient report.
report(cls, name = rows[1], file = NULL)
Figure: Classification results for a case with mixed transcriptional characteristics.

Figure: Classification results for a case with mixed transcriptional characteristics.

Lastly, the simulated dataset includes samples with randomized data:

# Select poor quality samples.
rows <- which(primary$CATEG == "Unclassified")
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
##                  LABEL MATCH     FREQ  PROX EXCL
## patient41 Unclassified   HLF 2.27e-05 0.117 0.22
# Show patient report.
report(cls, name = rows[1], file = NULL)
Figure: Classification results for a case with atypical data.

Figure: Classification results for a case with atypical data.

4 Train new classifier

4.1 Configuration

In Allspice, a classifier is actually a collection of classification models. These models are enclosed in classification assets (objects of the Asset class). As was observed in the previous sections, the default Allspice classifier contains three assets (ALL subtypes, driver genes and tissues). This section describes how to create a new RNA-seq classification asset from the simulated B-cell ALL dataset. Another example that uses the classisal Iris flower dataset is also presented as a demonstration on how to use Allspice for any type of dataset with category labels.

Before creating the assets, it is important to configure the models for different types of datasets. The default configuration is shown below:

# Create a new empty asset.
bALL <- asset()
print(configuration(bALL))
##            norm     nonzero.min   nonzero.ratio        standard       logarithm 
##            1.00          100.00            0.01            1.00            1.00 
##      ninput.max     rrinput.max   proximity.min exclusivity.min 
##           20.00            0.36            0.50            0.50

RNA-seq normalization parameters include norm (if set to 0, normalization is not performed), nonzero.min (the minimum read count considered non-neglible expression) and nonzero.ratio (minimum ratio of non-neglible values for a transcript to be included in the output). Allspice uses the DESeq2 method for the normalization of read counts (Genome Biol 15, 550, 2014).

Standardization parameters include standard (if set to 0, standardization is not performed) and logarithm (if set to 0, data values are used without taking the logarithm). The exact logarithmic transformation is log2(x + 1).

Feature selection parameters include ninput.max (the maximum number of features to be used for classification) and rrinput.max (the maximum correlation r-squared to be allowed between features).

Classification parameters include proximity.min (minimum similarity rating considered for reliable classification) and exclusivity.min (minimum exclusivity for non-ambiguous classification).

Only empty assets can be re-configured. To change the parameters, use a numeric vector with appropriately named elements:

# Re-configure asset.
configuration(bALL) <- c(ninput.max=30, nonzero.min=90)
print(configuration(bALL)[c("ninput.max","nonzero.min")])
##  ninput.max nonzero.min 
##          30          90

Please use the tabs at the top to navigate to the next page.

4.2 RNA-seq

To populate an empty asset, some data organization is required. The first thing to decide is the title of the asset, which will be displayed in the visual report (see figures in the previous section). The process begins with a list:

# Prepare asset title.
materials <- list(title="Simutypes")

Next, non-normalized RNA data are included. The dataset needs to be a numerical matrix with samples in columns:

# Prepare RNA-seq read counts.
materials$dat <- simu$counts

Covariates can be in a data frame or a matrix with samples on rows, just make sure all the columns are numeric:

# Prepare covariate data.
materials$covariates <- simu$metadata[,c("MALE","AGE")]

It is very important that all the rows and columns are named in each data structure since they will be used later to link the inputs together. It is good practice to always ensure that the row and column names are written with alphanumeric characters without white spaces, otherwise the automatic R filters may change some of the characters and cause unexpected errors.

The last remaining source of information are the pre-defined subtypes themselves that will be used to train the classification model. In this example, the subtypes are mutually exclusive (a patient can only exhibit a single subtype), therefore the subtypes can be obtained from the single column in the metadata:

# Prepare subtype information.
categ <- simu$metadata[,"SUBTYPE",drop=FALSE]
rows <- which(categ != "Contaminated")
materials$bits <- categ[rows,,drop=FALSE]

Note the drop flag in the code above; this ensures that the row names are carried forward to the bits element and can therefore be used to link the subtype information with the rest of the patient data. The dataset contains simulated cases where the category is undefined or the data are contaminated. These are excluded from training.

If necessary, Allspice also allows overlapping categories, but then the bits need to be supplied in a binary matrix where 1s indicate category membership (the name of a column is the name of the corresponding category). Samples should be organized into rows similarly to the covariates.

In the final step, the materials are assembled into the final classification asset:

# Assemble the classification asset.
bALL <- asset()
assemble(bALL) <- materials
## Warning: Asset.assemble(): 285 / 300 samples included.
## Warning: Asset.assemble(): 14 / 17 categories included.

Since we excluded the category labels of “bad” samples, the training set includes a subset of all samples. Categories that are too small (too few samples) are ignored and will produce a warning. It is good practice to prepare your dataset in such a way that all categories have at least five members.

# Save asset to disk.
tpath <- tempfile()
export(bALL, folder = tpath)
## [1] "/tmp/Rtmpj1ArJu/file41367a1f7c07/configuration.txt"
## [2] "/tmp/Rtmpj1ArJu/file41367a1f7c07/categories.txt"   
## [3] "/tmp/Rtmpj1ArJu/file41367a1f7c07/reference.txt"    
## [4] "/tmp/Rtmpj1ArJu/file41367a1f7c07/covariates.txt"   
## [5] "/tmp/Rtmpj1ArJu/file41367a1f7c07/centroids.txt"    
## [6] "/tmp/Rtmpj1ArJu/file41367a1f7c07/coefficients.txt"

The asset can now be loaded into a classifier:

# Create a classifier.
clstest <- classifier(tpath)
## 
## Simutypes
##     configuration -> 162 bytes
##     categories -> 917 bytes
##     reference -> 295,772 bytes
##     covariates -> 100 bytes
##     centroids -> 5,333 bytes
##     coefficients -> 2,453 bytes
# Classify samples.
simutest <- bcellALL(5)
covariates(clstest) <- simutest$metadata
profiles(clstest) <- simutest$counts
primtest <- predictions(clstest)[[1]]
print(primtest[,c("LABEL","MATCH","FREQ","PROX","EXCL")])
##               LABEL       MATCH  FREQ  PROX  EXCL
## patient1 ETV6.RUNX1  ETV6.RUNX1 0.997 0.998 0.999
## patient2  TCF3.PBX1   TCF3.PBX1 0.690 0.958 0.637
## patient3  Ambiguous Hypodiploid 0.432 0.880 0.356
## patient4         Ph          Ph 0.940 0.975 0.949
## patient5      KMT2A       KMT2A 0.971 0.991 0.997
# Show correct subtypes.
print(simutest$metadata)
##          MALE AGE     SUBTYPE
## patient1    1   6  ETV6.RUNX1
## patient2    1  17   TCF3.PBX1
## patient3    1  50 Hypodiploid
## patient4    0  41          Ph
## patient5    0  49       KMT2A

Please use the tabs at the top to navigate to the next page.

4.3 Iris data

Base R containts the classic Iris dataset that is attributed to Edgar Anderson who collected it and Ronald Fisher who used it as an example of linear discriminant analysis:

# Iris flower dataset.
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To create a classification asset, it is good to make sure that the data are properly identified:

# Set row names.
flowers <- iris
rownames(flowers) <- paste0("flower", 1:nrow(flowers))
print(flowers[c(1,80,150),])
##           Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## flower1            5.1         3.5          1.4         0.2     setosa
## flower80           5.7         2.6          3.5         1.0 versicolor
## flower150          5.9         3.0          5.1         1.8  virginica

The training dataset can now be finalized (note the format of the data matrix):

# Prepare training set.
materials <- list(title="Iris species")
materials$dat <- t(flowers[,1:4]) # vars on rows, samples on columns
materials$bits <- flowers[,"Species",drop=FALSE]

The original species labels are short, however, it is nice to have the full species names in the results as well. Adding labels can be accomplished by

# Set human-readable category labels.
model <- asset()
labels <- c("Iris Setosa", "Iris Virginica", "Iris Versicolor")
names(labels) <- c("setosa", "virginica", "versicolor")
visuals(model) <- labels

Next, the RNA-seq normalization and logarithm needs to be disabled since this is not RNA-seq data:

# Configure a new asset.
configuration(model) <- c(norm=FALSE, logarithm=FALSE)
configuration(model) <- c(nonzero.min=0, nonzero.ratio=0)
print(configuration(model))
##            norm     nonzero.min   nonzero.ratio        standard       logarithm 
##            0.00            0.00            0.00            1.00            0.00 
##      ninput.max     rrinput.max   proximity.min exclusivity.min 
##           20.00            0.36            0.50            0.50

The asset can now be finished

# Assemble the classification asset.
assemble(model) <- materials
tpath <- tempfile()
export(model, folder = tpath)
## [1] "/tmp/Rtmpj1ArJu/file4136302d7d90/configuration.txt"
## [2] "/tmp/Rtmpj1ArJu/file4136302d7d90/categories.txt"   
## [3] "/tmp/Rtmpj1ArJu/file4136302d7d90/reference.txt"    
## [4] "/tmp/Rtmpj1ArJu/file4136302d7d90/centroids.txt"    
## [5] "/tmp/Rtmpj1ArJu/file4136302d7d90/coefficients.txt"

and tested:

# Classify samples.
clsiris <- classifier(tpath)
## 
## Iris species
##     configuration -> 160 bytes
##     categories -> 296 bytes
##     reference -> 264 bytes
##     centroids -> 298 bytes
##     coefficients -> 313 bytes
profiles(clsiris) <- t(flowers[,1:4])
iristest <- predictions(clsiris)[[1]]
print(iristest[c(1,80,150),c("LABEL","MATCH","PROX","EXCL")])
##                     LABEL      MATCH  PROX  EXCL
## flower1       Iris Setosa     setosa 1.000 1.000
## flower80  Iris Versicolor versicolor 0.887 0.931
## flower150       Ambiguous versicolor 0.827 0.044

We know that the Setosa species is easiest to distinguish of the three (see e.g. the Wikipedia page on the Iris study) and this is manifested as an increase in ambiguous results for the two other categories:

# Summary of results.
print(table(iristest$LABEL, flowers$Species))
##                  
##                   setosa versicolor virginica
##   Ambiguous            0         11        16
##   Iris Setosa         49          0         0
##   Iris Versicolor      0         33         1
##   Iris Virginica       0          3        29
##   Unclassified         1          3         4

Please use the tabs at the top to navigate to the next page.

4.4 Supercategories

On the previous tab, we changed the labels of the categories. It is also possible to set duplicated labels, indeed, the default classifier contains a tissue detection asset where the B-cell ALL covers multiple ALL subtype centroids. Let’s revisit the ALL classifier:

# Default ALL classifier.
cls <- classifier()
# Predict source tissue.
simu <- bcellALL(5)
covariates(cls) <- simu$metadata
profiles(cls) <- simu$counts
tissues <- predictions(cls)[[3]]
print(tissues[,c("LABEL","CATEG","MATCH","MATCH.2nd")])
##               LABEL        CATEG        MATCH      MATCH.2nd
## patient1 B-cell ALL   ETV6.RUNX1   ETV6.RUNX1 GTEx.Fallopian
## patient2 B-cell ALL    PAX5.p80r    PAX5.p80r        ETP.ALL
## patient3 B-cell ALL  Hypodiploid  Hypodiploid    Lymphoblast
## patient4 B-cell ALL Hyperdiploid Hyperdiploid GTEx.Fallopian
## patient5 B-cell ALL        CRLF2        CRLF2        ETP.ALL

Notice how the predicted categories are different ALL subtypes, but the label for every one of them is “B-cell ALL”. In the visual report, only the supercategory is displayed (see the figures in the section on sample classification). The output also shows the second best matching category, which in this case is the T-cell ALL RNA centroid. This is expected, since it is another similar form of leukemia.

Why would you use labels to define these supercategories? The answer is a technical one. You could put all the ALL samples together and calculate a single B-cell ALL centroid, however, this model would be at risk of being too coarse considering the great diversity of RNA profiles associated with specific subtypes. For example, the rare but distinctive subtype driven by the N159Y alteration in IKZF1 could be considered an outlier as it is far from the overall ALL centroid, and thus these samples could be incorrectly classified if the IKZF1 N159Y was not defined as an explicit subtype.

4.5 Asset contents

Classification assets are stored as text files on the disk and, although not recommended, it is possible to view and edit them directly. For instance, the files for ALL subtypes that were imported into the default classifier can be found via system files that are part of the R library:

# Show asset contents.
base <- system.file(package = "Allspice")
folder <- file.path(base, "subtypes")
print(dir(folder))
## [1] "categories.txt"    "centroids.txt"     "coefficients.txt" 
## [4] "configuration.txt" "covariates.txt"    "nomenclature.txt" 
## [7] "reference.txt"

Categories file contains information on the subtypes and their visual attributes:

# Category information.
dat <- read.delim(file.path(folder, "categories.txt"))
print(dat)
##           CATEG          LABEL     COLOR COLOR.light COLOR.dark
## 1      BCL2_MYC       BCL2/MYC #FF47BDFF   #FF67DEFF  #E52092FF
## 2     CDX2.high    CDX2 hi-exp #FF458CFF   #FF65ABFF  #E61F66FF
## 3         CRLF2          CRLF2 #FF435BFF   #FF6278FF  #E71D39FF
## 4          DUX4           DUX4 #FF4531FF   #FF644BFF  #E81F13FF
## 5    ETV6.RUNX1    ETV6::RUNX1 #FF5A28FF   #FF7D3FFF  #E62F0FFF
## 6           HLF            HLF #FF701FFF   #FF9633FF  #E43F0AFF
## 7  Hyperdiploid   Hyperdiploid #FA8816FF   #FDAE27FF  #DC5306FF
## 8   Hypodiploid    Hypodiploid #ECA60DFF   #F8C71AFF  #C96F04FF
## 9        iAMP21         iAMP21 #DEC304FF   #F2E00EFF  #B58B01FF
## 10  IKZF1.n159y    IKZF1 N159Y #BDD708FF   #D7EF13FF  #93A003FF
## 11        KMT2A          KMT2A #8AE218FF   #A7F529FF  #62AE09FF
## 12        MEF2D          MEF2D #57EC28FF   #76FA3FFF  #31BC0FFF
## 13        NUTM1          NUTM1 #38E455FF   #53F56EFF  #19B536FF
## 14     PAX5.alt      PAX5 alt. #23D390FF   #37EBA9FF  #0EA26FFF
## 15    PAX5.p80r      PAX5 P80R #0EC1CBFF   #1BE1E5FF  #0490A7FF
## 16           Ph Ph (BCR::ABL1) #11ACE1FF   #20D0F3FF  #057CC2FF
## 17    TCF3.PBX1     TCF3::PBX1 #1A97F0FF   #2DBDF9FF  #0867D6FF
## 18       ZNF384         ZNF384 #2382FFFF   #39AAFFFF  #0C52EBFF
## 19 Unclassified   Unclassified   #888888     #b0b0b0    #404040
## 20    Ambiguous      Ambiguous   #888888     #b0b0b0    #404040

Centroids file contains subtype profiles for pre-processed data values:

# Standardized subtype profiles.
dat <- read.delim(file.path(folder, "centroids.txt"))
print(dat[1:5,1:6])
##     VAR BCL2_MYC CDX2.high  CRLF2   DUX4 ETV6.RUNX1
## 1  ECM1   -1.910    -1.075  0.729 -1.645    -0.0575
## 2 BLACE   -1.357     1.211  0.667 -1.230     0.9578
## 3 TLCD5   -0.366     0.771 -0.969 -0.149     1.0247
## 4 LAMP5    0.465     1.738 -0.573 -0.633    -0.3943
## 5  MSR1   -0.167    -0.423  0.647 -0.541     1.1322
cat(nrow(dat), " genes, ", ncol(dat), " subtypes\n", sep="")
## 45 genes, 19 subtypes

Coefficients file contains the parameters for the back-end regression analysis that combines the RNA biomarker from the centroid model with the covariates. B0 is the intercept and B1 is the coefficient for the RNA biomarker. Note that the coefficients depend on the scale of the inputs (covariates are never standardized):

# Regression coefficients.
dat <- read.delim(file.path(folder, "coefficients.txt"))
print(dat[1:5,])
##        CATEG MODEL   B0     B1     MALE      AGE
## 1   BCL2_MYC  prox 3.06  -3.60  0.77379  0.01618
## 2  CDX2.high  prox 3.49  -3.27 -0.22648  0.01048
## 3      CRLF2  prox 5.26  -5.81  0.05526 -0.00232
## 4       DUX4  prox 9.80 -10.06  0.55972  0.01437
## 5 ETV6.RUNX1  prox 9.61 -10.21  0.00285  0.00106

Configuration file contains the settings to determine how data should be processed:

# Asset settings.
dat <- read.delim(file.path(folder, "configuration.txt"))
print(dat)
##              PARAM       VALUE
## 1            title ALL subtype
## 2             norm           1
## 3      nonzero.min         100
## 4    nonzero.ratio        0.01
## 5         standard           1
## 6        logarithm           1
## 7       ninput.max          45
## 8      rrinput.max        0.36
## 9    proximity.min         0.5
## 10 exclusivity.min         0.5

Covariates file contains the basic statistics of the covariates. The MEAN is used as replacements for missing values:

# Covariate statistics.
dat <- read.delim(file.path(folder, "covariates.txt"))
print(dat)
##    VAR    N   MEAN     SD
## 1 MALE 1598  0.566  0.496
## 2  AGE 1598 23.886 21.831

Nomenclature file contains a dictionary for converting gene names from one system to another. In this example, the asset uses Hugo symbols but can translate ENSEMBL gene names as well:

# Gene names.
dat <- read.delim(file.path(folder, "nomenclature.txt"))
print(dat[1:5,])
##      VAR         ENSEMBL                                               NAME
## 1   NFYA ENSG00000001167       nuclear transcription factor Y subunit alpha
## 2  STPG1 ENSG00000001460             sperm tail PG-rich repeat containing 1
## 3  ENPP4 ENSG00000001561 ectonucleotide pyrophosphatase/phosphodiesterase 4
## 4 SEMA3F ENSG00000001617                                      semaphorin 3F
## 5   LAP3 ENSG00000002549                           leucine aminopeptidase 3

Reference file contains basic statistics of the RNA-seq training data. Only genes that are usable under the asset settings are included. The VALUE column is the median of original values, whereas the mean and standard deviation are calculated from the logarithms if the asset is so configured:

# RNA reference profile.
dat <- read.delim(file.path(folder, "reference.txt"))
print(dat[1:5,])
##      VAR  VALUE  MEAN    SD
## 1   NFYA 3008.9 11.55 0.631
## 2  STPG1   85.4  6.45 1.226
## 3  ENPP4  129.6  7.04 2.060
## 4 SEMA3F  250.6  7.99 2.475
## 5   LAP3 1985.9 10.95 0.715
cat(nrow(dat), " genes\n", sep="")
## 5808 genes

5 Build information

## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Allspice_1.0.7
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.30   R6_2.5.1        lifecycle_1.0.3 jsonlite_1.8.4 
##  [5] magrittr_2.0.3  evaluate_0.18   highr_0.9       stringi_1.7.8  
##  [9] rlang_1.0.6     cachem_1.0.6    cli_3.4.1       jquerylib_0.1.4
## [13] bslib_0.4.1     vctrs_0.5.1     rmarkdown_2.18  tools_4.2.2    
## [17] stringr_1.5.0   glue_1.6.2      xfun_0.35       yaml_2.3.6     
## [21] fastmap_1.1.0   compiler_4.2.2  htmltools_0.5.4 knitr_1.41     
## [25] sass_0.4.4
## [1] "2023-01-20 13:40:56 ACDT"