What is CellTypist?

CellTypist is an automated cell type annotation tool for scRNA-seq datasets on the basis of logistic regression classifiers optimised by the stochastic gradient descent algorithm. Through CellTypist, cell type labels can be transferred from the built-in models (with a current focus on immune cell types) or any user-trained models to the query data.

How are CellTypist models trained?

All models are built on the logistic regression framework. Traditional logistic regression will be used in most cases. SGD learning can be optionally implemented depending on the size of the training dataset. For example, when the training dataset contains a huge number of cells, the data can be modelled with SGD logistic regression using mini-batch training. Briefly, in each epoch cells are shuffled and binned into equal-sized mini-batches (1,000 cells per batch), and later are sequentially trained by 100 such batches randomly sampled out of all batches. This process is repeated for 10~30 epochs. Users can also generate their own models through one of these approaches.

Where can I find all available models?

Check the Model list for a listing of all models. Current models may be updated and new models may be added.

How are cell type labels assigned to query datasets?

For each shared gene between the model and query data, CellTypist will standardise it according to the parameters recorded in the model. The decision scores of query cells are defined as the linear combination of the scaled gene expression and the model coefficients associated with a given cell type, and the cell type with the maximal decision score is selected as the predicted identity.

How does the majority voting add to the prediction?

Prediction results are refined by a majority voting approach based on the idea that transcriptionally similar cells are more likely to form a (sub)cluster regardless of their individual prediction outcomes. The query data will be over-clustered (by Leiden clustering with a canonical Scanpy pipeline) and each resulting subcluster is assigned the identity supported by the dominant cell type predicted. Through this, distinguishable small subclusters will be assigned distinct labels, and homogenous subclusters will be assigned the same labels and iteratively converge to a bigger cluster.

How can I contribute to the cell types/annotations?

You can send a pull request after modifying the table containing basic cell type information (Basic_celltype_information.xlsx). You can also send your suggested changes for a given cell type within the encyclopedia to celltypist@sanger.ac.uk.

Can you include my model in CellTypist?

We appreciate any models generated by the users. Please follow the training procedure in CellTypist, and send to celltypist@sanger.ac.uk.