scAnnotate
is a supervised machine learning model for
cell-type annotation.
For more details, see our paper: [scAnnotate: an automated cell type annotation tool for single-cell RNA-sequencing data].
For this tutorial, we’ll work with two subsets of the human
Peripheral Blood Mononuclear Cells (PBMC) scRNA-seq dataset from the
SeuratData
package that were sequenced using two different
platforms.
First, we’ll load the scAnnotate
package.
We assume that you have log-transformed (i.e. size-factor normalized) matrices for both the training and testing data, where each row is a cell and each column is a gene.
The example datasets are already log-transformed and normalized. You can find more details about the example datasets by typing the following commands into the R console:
scAnnotate
has two separate workflows with different
batch effect removal steps based on the size of the training data. We
suggest using Seurat for dataset with at most one rare cell population
(at most one cell population less than 100 cells) and using Harmony for
dataset with at least two rare cell populations (at least two cell
populations less than 100 cells).
Our example datasets are already log-transformed and normalized.
Suppose your input gene expression data is in raw counts. In that
case, our software will normalize the raw input data using the
NormalizeData function from the Seurat
package, via the
“LogNormalize” method and a scale factor of 10,000. That normalizes the
gene expression by each cell’s “sequencing depth” and applies a natural
logarithmic transformation. When you input raw data, you should choose
the parameter lognormalized=FALSE using the scAnnotate
functions.
train
A data frame with cell type labels as the first
column, followed by a gene expression matrix where each row is a cell
and each column is a gene from the training dataset.test
A gene expression matrix where each row is a cell
and each column is a gene from the testing dataset.distribution
A character string indicating the
distribution assumption for positive gene expression levels. It should
be one of “normal” (default) or “dep”. “dep” refers to depth measure,
which is a non-parametric distribution estimation approach.correction
A character string indicating the batch
effect removal method of choice. It should be one of “auto” (default),
“seurat”, or “harmony”. “auto” will automatically select the batch
effect removal method that corresponds to scAnnotate’s recommended
workflow for the given situation. We use Seurat for dataset with at most
one rare cell population (at most one cell population less than 100
cells) and Harmony for dataset with at least two rare cell populations
(at least two cell populations less than 100 cells).screening
A character string indicating the gene
screening method of choice. It should be one of “wilcox”(default) or
“t.test”threshold
A numeric value indicates the threshold used
for probabilities to classify cells. It should be number from “0”
(default) to “1”. If there is no probability associated with any cell
type higher than the threshold, the given cell will be labeled as
“unassigned”.lognormalized
A logical string indicates whether both
input data are log-normalized or raw matrices. TRUE (default) indicates
input data are log-normalized, and FALSE indicates input data are raw
data.A vector containing the annotated cell type labels for the cells in the test data.