Author: Ville-Petteri Mäkinen
Version: 1.0.7
Abstract: Allspice is a lightweight machine learning tool that was initially developed for RNA profiling of acute lymphoplastic leukemia, but it can be used for any problem where multiple classes need to be identified from multi-dimensional data. The classifier contains optimized mean profiles for the classes (centroids) as observed in the training data, and new samples are matched to these centroids using the shortest Euclidean distance. Allspice produces both numerical and visual output of the classification results and flags samples with mixed features from multiple classes or atypical values.
Citation: Ville-Petteri Mäkinen, Jacqueline Rehn, James Breen, David Yeung, Deborah L White (2022) Multi-cohort transcriptomic subtyping of B-cell acute lymphoblastic leukemia, Int J Mol Sci 2022, 23(9), 4574, https://doi.org/10.3390/ijms23094574
Resources: Additional assets were created from tissue-specific RNA-seq data from the Genotype-Tissue Expression Project v8 [Lonsdale J et al. Nat Genet 45, 580–585 (2013) https://doi.org/10.1038/ng.2653], from lymphoblastoid cell lines [Lappalainen et al. Nature 2013 September 26; 501(7468): 506–511. doi:10.1038/nature12531], from pediatric T-cell leukemia samples [Drobna-Śledzińska et al. ArrayExpress E-MTAB-11759, accessed 2022], and from 11 EPT and 134 T-cell ALL samples (South Australian Health and Medical Research Institute, 2022, unpublished).
You can install Allspice using the standard procedure:
# Install the package from a remote repository.
install.packages("Allspice")
To activate and list the library functions, type
# Activate the library.
library("Allspice")
packageVersion("Allspice")
## [1] '1.0.7'
ls("package:Allspice")
## [1] "assemble<-" "asset" "bcellALL" "classifier"
## [5] "classify" "configuration" "configuration<-" "covariates<-"
## [9] "export" "information" "nomenclature<-" "normalize"
## [13] "predictions" "profiles<-" "report" "scores"
## [17] "standardize" "visuals<-"
Each Allspice function comes with a help page that contains a code example, such as
# Access function documentation (not shown in vignette).
? Allspice::classifier
You can run all the code examples in one go by typing
# Run all code examples (not shown in vignette).
fn <- system.file("examples.R", package = "Allspice")
source(fn)
To run this tutorial, we will use simulated data that mimics B-cell acute lymphoblast leukemia samples:
# Generate gene RNA read counts.
simu <- bcellALL(300)
print(simu$counts[1:5,1:6])
## patient1 patient2 patient3 patient4 patient5 patient6
## NFYA 3198 3214 2125 2137 3250 3676
## STPG1 47 48 299 55 29 98
## ENPP4 586 38 75 196 1863 55
## SEMA3F 527 87 198 69 1707 132
## LAP3 863 1838 3187 2192 1149 2312
print(simu$metadata[1:6,])
## MALE AGE SUBTYPE
## patient1 0 12 ETV6.RUNX1
## patient2 0 27 Ph
## patient3 1 27 KMT2A
## patient4 1 34 PAX5.p80r
## patient5 1 29 ETV6.RUNX1
## patient6 1 50 ZNF384
The output contains the unadjusted read counts as would be produced by a standard RNA-seq pipeline and patients’ sex and age are also simulated. Please note that these data are artificial and meant only for demonstration purposes; they do not reflect the complex biology of ALL accurately.
The read counts must be stored in a matrix that has the samples organized into columns and the genes are listed as rows. For the meta-data, patients can be listed in rows as that is more convenient in most situations.
The first step is to create a new classifier:
# Set up a classifier for genetic B-cell ALL subtypes.
cls <- classifier()
##
## ALL subtype
## configuration -> 164 bytes
## categories -> 984 bytes
## reference -> 335,109 bytes
## covariates -> 103 bytes
## nomenclature -> 325,618 bytes
## centroids -> 15,212 bytes
## coefficients -> 3,150 bytes
##
## Driver genes
## configuration -> 165 bytes
## categories -> 1,461 bytes
## reference -> 333,812 bytes
## covariates -> 102 bytes
## nomenclature -> 324,271 bytes
## centroids -> 22,764 bytes
## coefficients -> 4,796 bytes
##
## Source tissue
## configuration -> 166 bytes
## categories -> 2,619 bytes
## reference -> 251,540 bytes
## covariates -> 105 bytes
## nomenclature -> 242,601 bytes
## centroids -> 42,622 bytes
## coefficients -> 9,144 bytes
By default, the command creates an object that contains three models: i) ALL subtype prediction, ii) driver gene prediction and iii) tissue profiling to make sure the RNA data are suitable for analyses.
The next step is to import the covariates (e.g. sex and age) into the classifier. If there are no covariate data available, this step can be skipped. The default classifier uses MALE and AGE as column headings, see the format of the metadata in the first section of the vignette. The names of covariates are also available as
# List covariates.
info <- information(cls)
print(info$covariates[,c("ASSET","TITLE","COVAR")])
## ASSET TITLE COVAR
## 1 1 ALL subtype MALE
## 2 1 ALL subtype AGE
## 3 2 Driver genes MALE
## 4 2 Driver genes AGE
## 5 3 Source tissue MALE
## 6 3 Source tissue AGE
To set the covariates, use the command
# Set covariates.
covariates(cls) <- simu$metadata
Missing values or incomplete data are allowed, however, a warning will be produced when the subtypes are predicted. Also, it is important to set the covariates before loading the RNA read counts.
Once the covariates have been set, the RNA-seq read counts are next:
# Load RNA-seq profiles.
profiles(cls) <- simu$counts
The above command pre-processes the data and predicts the subtypes based on the combined information from covariates and read counts. To get the results, use the command
# Prediction results.
pred <- predictions(cls)
primary <- pred[[1]]
print(primary[1:6,c("LABEL","FREQ","PROX","EXCL")])
## LABEL FREQ PROX EXCL
## patient1 ETV6::RUNX1 0.977 0.997 0.998
## patient2 Ph (BCR::ABL1) 0.788 0.951 0.824
## patient3 KMT2A 0.996 1.000 0.997
## patient4 PAX5 P80R 0.841 0.984 0.965
## patient5 ETV6::RUNX1 0.993 0.999 1.000
## patient6 ZNF384 0.992 1.000 1.000
The output is a list of data frames, where each element contains the results generated by a classification asset (see the section on how to create your own classifier for more information on assets). Here, we are interested in the ALL subtype which is the primary asset in the classifier.
Please use the tabs at the top to navigate to the next page.
Here are selected results from the previous page again:
# Prediction results.
ambig <- which(primary$CATEG == "Ambiguous")
uncla <- which(primary$CATEG == "Unclassified")
rows <- unique(c(1:5, ambig[1], uncla[1]))
print(primary[rows,c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 9.77e-01 0.997 0.998
## patient2 Ph (BCR::ABL1) Ph 7.88e-01 0.951 0.824
## patient3 KMT2A KMT2A 9.96e-01 1.000 0.997
## patient4 PAX5 P80R PAX5.p80r 8.41e-01 0.984 0.965
## patient5 ETV6::RUNX1 ETV6.RUNX1 9.93e-01 0.999 1.000
## patient17 Ambiguous PAX5.alt 6.20e-01 0.940 0.463
## patient41 Unclassified HLF 2.27e-05 0.117 0.220
The data frame includes additional information besides the most plausible ALL subtype LABEL. The ALL subtype that most resembles the patient’s profile is defined as the best-matching subtype MATCH. For most cases, LABEL will refer to the same subtype as MATCH, but failure to pass quality checks and the rarity of a subtype may sometimes lead to conflicting results.
The next column named FREQ tells the frequency of the predicted subtype given the observed RNA profile and covariates. For example, if FREQ is 0.9, then you would expect to see 9 out of 10 patients that share this data profile to have that predicted ALL subtype. This metric depends on the training set that was used when the classification asset was created. Thus rare subtypes are likely to get lower frequency estimates even if the RNA profile would be a close match.
The frequency metric does not capture mutual exclusivity because it is evaluated independently for each subtype. For example, Allspice allows patients to have more than one subtype in parallel when the statistical model is trained. In this scenario, parallel subtypes may all get high frequencies, of which the highest would be chosen as the predicted subtype. To identify samples with mixed transcriptional characteristics, the results also includes a probabilistic estimate for exclusivity in the column EXCL that tells how likely it is that a sample comes from a single ALL subtype.
PROX (proximity) and EXCL (exclusivity) are used as quality metrics. By default, Allspice labels any samples with <50% proximity as ‘Unclassified’. Of those that pass the probability threshold, samples with <50% exclusivity are labelled as ‘Ambiguous’. The proximity measure is conceptually the same as subtype frequency, however, any differences in subtype prevalence are adjusted away.
You may have noticed that the subtype labels and the matched subtype names are slightly different. This is because text labels with non-alphanumeric characters may be automatically changed by R when used as row or column names in data structures, thus the subtype identities are made safe compared to the human readable labels. You can access subtype labels by typing:
# Access subtype labelling information.
info <- information(cls)
print(info$categories[1:5,c("ASSET","TITLE","CATEG","LABEL")])
## ASSET TITLE CATEG LABEL
## 1 1 ALL subtype BCL2_MYC BCL2/MYC
## 2 1 ALL subtype CDX2.high CDX2 hi-exp
## 3 1 ALL subtype CRLF2 CRLF2
## 4 1 ALL subtype DUX4 DUX4
## 5 1 ALL subtype ETV6.RUNX1 ETV6::RUNX1
See also the Iris data example in the section on creating new classifiers. It contains instructions on how to set the labels for a classification asset.
Please use the tabs at the top to navigate to the next page.
Detailed classification results can be visualised on a case-by-case basis. In the first example, we first identify a patient sample that passed quality control and did not show mixed characteristics:
# Select successful classification.
rows <- which((primary$CATEG != "Ambiguous") & (primary$FREQ > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient1 ETV6::RUNX1 ETV6.RUNX1 0.977 0.997 0.998
# Show patient report.
report(cls, name = rows[1], file = NULL)
The second example shows a case where the subtype is less obvious due to mixed transcriptional characteristics. Sometimes this can be caused by random incidental fluctuations that are always present in biological data, but some of these cases will be “real” in the sense that the same individual can harbour multiple cancer driving mutations or some alterations affect wide regions of the genome (e.g. deletions or duplications of chromosomal segments) and thus multiple genes would be involved.
# Select ambiguous classification.
rows <- which((primary$CATEG == "Ambiguous") & (primary$PROX > 0.9))
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient17 Ambiguous PAX5.alt 0.62 0.94 0.463
Mixed profiles typically manifest multiple indicators for driver genes while still being identified as a B-cell ALL tissue type:
# Show patient report.
report(cls, name = rows[1], file = NULL)
Lastly, the simulated dataset includes samples with randomized data:
# Select poor quality samples.
rows <- which(primary$CATEG == "Unclassified")
print(primary[rows[1],c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient41 Unclassified HLF 2.27e-05 0.117 0.22
# Show patient report.
report(cls, name = rows[1], file = NULL)
In Allspice, a classifier is actually a collection of classification models. These models are enclosed in classification assets (objects of the Asset class). As was observed in the previous sections, the default Allspice classifier contains three assets (ALL subtypes, driver genes and tissues). This section describes how to create a new RNA-seq classification asset from the simulated B-cell ALL dataset. Another example that uses the classisal Iris flower dataset is also presented as a demonstration on how to use Allspice for any type of dataset with category labels.
Before creating the assets, it is important to configure the models for different types of datasets. The default configuration is shown below:
# Create a new empty asset.
bALL <- asset()
print(configuration(bALL))
## norm nonzero.min nonzero.ratio standard logarithm
## 1.00 100.00 0.01 1.00 1.00
## ninput.max rrinput.max proximity.min exclusivity.min
## 20.00 0.36 0.50 0.50
RNA-seq normalization parameters include norm (if set to 0, normalization is not performed), nonzero.min (the minimum read count considered non-neglible expression) and nonzero.ratio (minimum ratio of non-neglible values for a transcript to be included in the output). Allspice uses the DESeq2 method for the normalization of read counts (Genome Biol 15, 550, 2014).
Standardization parameters include standard (if set to 0, standardization is not performed) and logarithm (if set to 0, data values are used without taking the logarithm). The exact logarithmic transformation is log2(x + 1).
Feature selection parameters include ninput.max (the maximum number of features to be used for classification) and rrinput.max (the maximum correlation r-squared to be allowed between features).
Classification parameters include proximity.min (minimum similarity rating considered for reliable classification) and exclusivity.min (minimum exclusivity for non-ambiguous classification).
Only empty assets can be re-configured. To change the parameters, use a numeric vector with appropriately named elements:
# Re-configure asset.
configuration(bALL) <- c(ninput.max=30, nonzero.min=90)
print(configuration(bALL)[c("ninput.max","nonzero.min")])
## ninput.max nonzero.min
## 30 90
Please use the tabs at the top to navigate to the next page.
To populate an empty asset, some data organization is required. The first thing to decide is the title of the asset, which will be displayed in the visual report (see figures in the previous section). The process begins with a list:
# Prepare asset title.
materials <- list(title="Simutypes")
Next, non-normalized RNA data are included. The dataset needs to be a numerical matrix with samples in columns:
# Prepare RNA-seq read counts.
materials$dat <- simu$counts
Covariates can be in a data frame or a matrix with samples on rows, just make sure all the columns are numeric:
# Prepare covariate data.
materials$covariates <- simu$metadata[,c("MALE","AGE")]
It is very important that all the rows and columns are named in each data structure since they will be used later to link the inputs together. It is good practice to always ensure that the row and column names are written with alphanumeric characters without white spaces, otherwise the automatic R filters may change some of the characters and cause unexpected errors.
The last remaining source of information are the pre-defined subtypes themselves that will be used to train the classification model. In this example, the subtypes are mutually exclusive (a patient can only exhibit a single subtype), therefore the subtypes can be obtained from the single column in the metadata:
# Prepare subtype information.
categ <- simu$metadata[,"SUBTYPE",drop=FALSE]
rows <- which(categ != "Contaminated")
materials$bits <- categ[rows,,drop=FALSE]
Note the drop flag in the code above; this ensures that the row names are carried forward to the bits element and can therefore be used to link the subtype information with the rest of the patient data. The dataset contains simulated cases where the category is undefined or the data are contaminated. These are excluded from training.
If necessary, Allspice also allows overlapping categories, but then the bits need to be supplied in a binary matrix where 1s indicate category membership (the name of a column is the name of the corresponding category). Samples should be organized into rows similarly to the covariates.
In the final step, the materials are assembled into the final classification asset:
# Assemble the classification asset.
bALL <- asset()
assemble(bALL) <- materials
## Warning: Asset.assemble(): 285 / 300 samples included.
## Warning: Asset.assemble(): 14 / 17 categories included.
Since we excluded the category labels of “bad” samples, the training set includes a subset of all samples. Categories that are too small (too few samples) are ignored and will produce a warning. It is good practice to prepare your dataset in such a way that all categories have at least five members.
# Save asset to disk.
tpath <- tempfile()
export(bALL, folder = tpath)
## [1] "/tmp/Rtmpj1ArJu/file41367a1f7c07/configuration.txt"
## [2] "/tmp/Rtmpj1ArJu/file41367a1f7c07/categories.txt"
## [3] "/tmp/Rtmpj1ArJu/file41367a1f7c07/reference.txt"
## [4] "/tmp/Rtmpj1ArJu/file41367a1f7c07/covariates.txt"
## [5] "/tmp/Rtmpj1ArJu/file41367a1f7c07/centroids.txt"
## [6] "/tmp/Rtmpj1ArJu/file41367a1f7c07/coefficients.txt"
The asset can now be loaded into a classifier:
# Create a classifier.
clstest <- classifier(tpath)
##
## Simutypes
## configuration -> 162 bytes
## categories -> 917 bytes
## reference -> 295,772 bytes
## covariates -> 100 bytes
## centroids -> 5,333 bytes
## coefficients -> 2,453 bytes
# Classify samples.
simutest <- bcellALL(5)
covariates(clstest) <- simutest$metadata
profiles(clstest) <- simutest$counts
primtest <- predictions(clstest)[[1]]
print(primtest[,c("LABEL","MATCH","FREQ","PROX","EXCL")])
## LABEL MATCH FREQ PROX EXCL
## patient1 ETV6.RUNX1 ETV6.RUNX1 0.997 0.998 0.999
## patient2 TCF3.PBX1 TCF3.PBX1 0.690 0.958 0.637
## patient3 Ambiguous Hypodiploid 0.432 0.880 0.356
## patient4 Ph Ph 0.940 0.975 0.949
## patient5 KMT2A KMT2A 0.971 0.991 0.997
# Show correct subtypes.
print(simutest$metadata)
## MALE AGE SUBTYPE
## patient1 1 6 ETV6.RUNX1
## patient2 1 17 TCF3.PBX1
## patient3 1 50 Hypodiploid
## patient4 0 41 Ph
## patient5 0 49 KMT2A
Please use the tabs at the top to navigate to the next page.
Base R containts the classic Iris dataset that is attributed to Edgar Anderson who collected it and Ronald Fisher who used it as an example of linear discriminant analysis:
# Iris flower dataset.
print(head(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To create a classification asset, it is good to make sure that the data are properly identified:
# Set row names.
flowers <- iris
rownames(flowers) <- paste0("flower", 1:nrow(flowers))
print(flowers[c(1,80,150),])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## flower1 5.1 3.5 1.4 0.2 setosa
## flower80 5.7 2.6 3.5 1.0 versicolor
## flower150 5.9 3.0 5.1 1.8 virginica
The training dataset can now be finalized (note the format of the data matrix):
# Prepare training set.
materials <- list(title="Iris species")
materials$dat <- t(flowers[,1:4]) # vars on rows, samples on columns
materials$bits <- flowers[,"Species",drop=FALSE]
The original species labels are short, however, it is nice to have the full species names in the results as well. Adding labels can be accomplished by
# Set human-readable category labels.
model <- asset()
labels <- c("Iris Setosa", "Iris Virginica", "Iris Versicolor")
names(labels) <- c("setosa", "virginica", "versicolor")
visuals(model) <- labels
Next, the RNA-seq normalization and logarithm needs to be disabled since this is not RNA-seq data:
# Configure a new asset.
configuration(model) <- c(norm=FALSE, logarithm=FALSE)
configuration(model) <- c(nonzero.min=0, nonzero.ratio=0)
print(configuration(model))
## norm nonzero.min nonzero.ratio standard logarithm
## 0.00 0.00 0.00 1.00 0.00
## ninput.max rrinput.max proximity.min exclusivity.min
## 20.00 0.36 0.50 0.50
The asset can now be finished
# Assemble the classification asset.
assemble(model) <- materials
tpath <- tempfile()
export(model, folder = tpath)
## [1] "/tmp/Rtmpj1ArJu/file4136302d7d90/configuration.txt"
## [2] "/tmp/Rtmpj1ArJu/file4136302d7d90/categories.txt"
## [3] "/tmp/Rtmpj1ArJu/file4136302d7d90/reference.txt"
## [4] "/tmp/Rtmpj1ArJu/file4136302d7d90/centroids.txt"
## [5] "/tmp/Rtmpj1ArJu/file4136302d7d90/coefficients.txt"
and tested:
# Classify samples.
clsiris <- classifier(tpath)
##
## Iris species
## configuration -> 160 bytes
## categories -> 296 bytes
## reference -> 264 bytes
## centroids -> 298 bytes
## coefficients -> 313 bytes
profiles(clsiris) <- t(flowers[,1:4])
iristest <- predictions(clsiris)[[1]]
print(iristest[c(1,80,150),c("LABEL","MATCH","PROX","EXCL")])
## LABEL MATCH PROX EXCL
## flower1 Iris Setosa setosa 1.000 1.000
## flower80 Iris Versicolor versicolor 0.887 0.931
## flower150 Ambiguous versicolor 0.827 0.044
We know that the Setosa species is easiest to distinguish of the three (see e.g. the Wikipedia page on the Iris study) and this is manifested as an increase in ambiguous results for the two other categories:
# Summary of results.
print(table(iristest$LABEL, flowers$Species))
##
## setosa versicolor virginica
## Ambiguous 0 11 16
## Iris Setosa 49 0 0
## Iris Versicolor 0 33 1
## Iris Virginica 0 3 29
## Unclassified 1 3 4
Please use the tabs at the top to navigate to the next page.
On the previous tab, we changed the labels of the categories. It is also possible to set duplicated labels, indeed, the default classifier contains a tissue detection asset where the B-cell ALL covers multiple ALL subtype centroids. Let’s revisit the ALL classifier:
# Default ALL classifier.
cls <- classifier()
# Predict source tissue.
simu <- bcellALL(5)
covariates(cls) <- simu$metadata
profiles(cls) <- simu$counts
tissues <- predictions(cls)[[3]]
print(tissues[,c("LABEL","CATEG","MATCH","MATCH.2nd")])
## LABEL CATEG MATCH MATCH.2nd
## patient1 B-cell ALL ETV6.RUNX1 ETV6.RUNX1 GTEx.Fallopian
## patient2 B-cell ALL PAX5.p80r PAX5.p80r ETP.ALL
## patient3 B-cell ALL Hypodiploid Hypodiploid Lymphoblast
## patient4 B-cell ALL Hyperdiploid Hyperdiploid GTEx.Fallopian
## patient5 B-cell ALL CRLF2 CRLF2 ETP.ALL
Notice how the predicted categories are different ALL subtypes, but the label for every one of them is “B-cell ALL”. In the visual report, only the supercategory is displayed (see the figures in the section on sample classification). The output also shows the second best matching category, which in this case is the T-cell ALL RNA centroid. This is expected, since it is another similar form of leukemia.
Why would you use labels to define these supercategories? The answer is a technical one. You could put all the ALL samples together and calculate a single B-cell ALL centroid, however, this model would be at risk of being too coarse considering the great diversity of RNA profiles associated with specific subtypes. For example, the rare but distinctive subtype driven by the N159Y alteration in IKZF1 could be considered an outlier as it is far from the overall ALL centroid, and thus these samples could be incorrectly classified if the IKZF1 N159Y was not defined as an explicit subtype.
Classification assets are stored as text files on the disk and, although not recommended, it is possible to view and edit them directly. For instance, the files for ALL subtypes that were imported into the default classifier can be found via system files that are part of the R library:
# Show asset contents.
base <- system.file(package = "Allspice")
folder <- file.path(base, "subtypes")
print(dir(folder))
## [1] "categories.txt" "centroids.txt" "coefficients.txt"
## [4] "configuration.txt" "covariates.txt" "nomenclature.txt"
## [7] "reference.txt"
Categories file contains information on the subtypes and their visual attributes:
# Category information.
dat <- read.delim(file.path(folder, "categories.txt"))
print(dat)
## CATEG LABEL COLOR COLOR.light COLOR.dark
## 1 BCL2_MYC BCL2/MYC #FF47BDFF #FF67DEFF #E52092FF
## 2 CDX2.high CDX2 hi-exp #FF458CFF #FF65ABFF #E61F66FF
## 3 CRLF2 CRLF2 #FF435BFF #FF6278FF #E71D39FF
## 4 DUX4 DUX4 #FF4531FF #FF644BFF #E81F13FF
## 5 ETV6.RUNX1 ETV6::RUNX1 #FF5A28FF #FF7D3FFF #E62F0FFF
## 6 HLF HLF #FF701FFF #FF9633FF #E43F0AFF
## 7 Hyperdiploid Hyperdiploid #FA8816FF #FDAE27FF #DC5306FF
## 8 Hypodiploid Hypodiploid #ECA60DFF #F8C71AFF #C96F04FF
## 9 iAMP21 iAMP21 #DEC304FF #F2E00EFF #B58B01FF
## 10 IKZF1.n159y IKZF1 N159Y #BDD708FF #D7EF13FF #93A003FF
## 11 KMT2A KMT2A #8AE218FF #A7F529FF #62AE09FF
## 12 MEF2D MEF2D #57EC28FF #76FA3FFF #31BC0FFF
## 13 NUTM1 NUTM1 #38E455FF #53F56EFF #19B536FF
## 14 PAX5.alt PAX5 alt. #23D390FF #37EBA9FF #0EA26FFF
## 15 PAX5.p80r PAX5 P80R #0EC1CBFF #1BE1E5FF #0490A7FF
## 16 Ph Ph (BCR::ABL1) #11ACE1FF #20D0F3FF #057CC2FF
## 17 TCF3.PBX1 TCF3::PBX1 #1A97F0FF #2DBDF9FF #0867D6FF
## 18 ZNF384 ZNF384 #2382FFFF #39AAFFFF #0C52EBFF
## 19 Unclassified Unclassified #888888 #b0b0b0 #404040
## 20 Ambiguous Ambiguous #888888 #b0b0b0 #404040
Centroids file contains subtype profiles for pre-processed data values:
# Standardized subtype profiles.
dat <- read.delim(file.path(folder, "centroids.txt"))
print(dat[1:5,1:6])
## VAR BCL2_MYC CDX2.high CRLF2 DUX4 ETV6.RUNX1
## 1 ECM1 -1.910 -1.075 0.729 -1.645 -0.0575
## 2 BLACE -1.357 1.211 0.667 -1.230 0.9578
## 3 TLCD5 -0.366 0.771 -0.969 -0.149 1.0247
## 4 LAMP5 0.465 1.738 -0.573 -0.633 -0.3943
## 5 MSR1 -0.167 -0.423 0.647 -0.541 1.1322
cat(nrow(dat), " genes, ", ncol(dat), " subtypes\n", sep="")
## 45 genes, 19 subtypes
Coefficients file contains the parameters for the back-end regression analysis that combines the RNA biomarker from the centroid model with the covariates. B0 is the intercept and B1 is the coefficient for the RNA biomarker. Note that the coefficients depend on the scale of the inputs (covariates are never standardized):
# Regression coefficients.
dat <- read.delim(file.path(folder, "coefficients.txt"))
print(dat[1:5,])
## CATEG MODEL B0 B1 MALE AGE
## 1 BCL2_MYC prox 3.06 -3.60 0.77379 0.01618
## 2 CDX2.high prox 3.49 -3.27 -0.22648 0.01048
## 3 CRLF2 prox 5.26 -5.81 0.05526 -0.00232
## 4 DUX4 prox 9.80 -10.06 0.55972 0.01437
## 5 ETV6.RUNX1 prox 9.61 -10.21 0.00285 0.00106
Configuration file contains the settings to determine how data should be processed:
# Asset settings.
dat <- read.delim(file.path(folder, "configuration.txt"))
print(dat)
## PARAM VALUE
## 1 title ALL subtype
## 2 norm 1
## 3 nonzero.min 100
## 4 nonzero.ratio 0.01
## 5 standard 1
## 6 logarithm 1
## 7 ninput.max 45
## 8 rrinput.max 0.36
## 9 proximity.min 0.5
## 10 exclusivity.min 0.5
Covariates file contains the basic statistics of the covariates. The MEAN is used as replacements for missing values:
# Covariate statistics.
dat <- read.delim(file.path(folder, "covariates.txt"))
print(dat)
## VAR N MEAN SD
## 1 MALE 1598 0.566 0.496
## 2 AGE 1598 23.886 21.831
Nomenclature file contains a dictionary for converting gene names from one system to another. In this example, the asset uses Hugo symbols but can translate ENSEMBL gene names as well:
# Gene names.
dat <- read.delim(file.path(folder, "nomenclature.txt"))
print(dat[1:5,])
## VAR ENSEMBL NAME
## 1 NFYA ENSG00000001167 nuclear transcription factor Y subunit alpha
## 2 STPG1 ENSG00000001460 sperm tail PG-rich repeat containing 1
## 3 ENPP4 ENSG00000001561 ectonucleotide pyrophosphatase/phosphodiesterase 4
## 4 SEMA3F ENSG00000001617 semaphorin 3F
## 5 LAP3 ENSG00000002549 leucine aminopeptidase 3
Reference file contains basic statistics of the RNA-seq training data. Only genes that are usable under the asset settings are included. The VALUE column is the median of original values, whereas the mean and standard deviation are calculated from the logarithms if the asset is so configured:
# RNA reference profile.
dat <- read.delim(file.path(folder, "reference.txt"))
print(dat[1:5,])
## VAR VALUE MEAN SD
## 1 NFYA 3008.9 11.55 0.631
## 2 STPG1 85.4 6.45 1.226
## 3 ENPP4 129.6 7.04 2.060
## 4 SEMA3F 250.6 7.99 2.475
## 5 LAP3 1985.9 10.95 0.715
cat(nrow(dat), " genes\n", sep="")
## 5808 genes
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Allspice_1.0.7
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.30 R6_2.5.1 lifecycle_1.0.3 jsonlite_1.8.4
## [5] magrittr_2.0.3 evaluate_0.18 highr_0.9 stringi_1.7.8
## [9] rlang_1.0.6 cachem_1.0.6 cli_3.4.1 jquerylib_0.1.4
## [13] bslib_0.4.1 vctrs_0.5.1 rmarkdown_2.18 tools_4.2.2
## [17] stringr_1.5.0 glue_1.6.2 xfun_0.35 yaml_2.3.6
## [21] fastmap_1.1.0 compiler_4.2.2 htmltools_0.5.4 knitr_1.41
## [25] sass_0.4.4
## [1] "2023-01-20 13:40:56 ACDT"