Cross-Species Cell Embeddings AI enhances cell type discovery, identifies previously elusive “Norn cells”

Published

Mar 27, 2024

Reading time

2 min read

Researchers used an AI system to identify animal cell types from gene sequences, including a cell type that conventional approaches had discovered only in the past year.

What’s new: Biologists at Stanford trained a system to produce embeddings that represent individual cells in an organism. This enabled them to find cell types that have common function in different animals; for instance, the Norn cell, a type of kidney cell that biologists had previously theorized but discovered only in 2023.

How it works: Universal Cell Embedding (UCE) comprises two transformers that produce embeddings of genes and cells respectively, plus a classifier based on a vanilla neural network. The authors trained the classifier, given embeddings of a gene and cell, to classify whether or not the cell produces the protein coded by that gene. The training dataset included RNA sequences of 36.2 million cells from eight animal species (humans and mice accounted for 33.9 million) along with related protein structures.

The authors represented each cell as a sequence of gene embeddings, laid out in the order in which they appear in the cell’s genome. Instead of including all of a cell’s genes, the authors sampled 1,024 genes known to encode proteins. A pretrained ESM-2 transformer computed each gene’s embedding based on the protein(s) — that is, amino acid sequence(s) — it produces.
The authors randomly masked 20 percent of the gene embeddings. Given the masked sequence, a vanilla transformer learned to compute an embedding of the cell.
For each gene in the cell, the authors concatenated its embedding with the cell embedding. Given the combined embeddings, the vanilla neural network learned to classify whether the genes encoded a protein.

Results: Cell embeddings produced by UCE enabled the authors to identify cell types in animal species that weren’t in the training set. For instance, the authors embedded a dataset of mouse cells and applied UMAP clustering to differentiate the types. They labeled the clusters as specific cell types (including Norn cells, which biologists took more than a century to find) based on the presence of certain genes that distinguish one cell type from another. Using the labels, they trained a logistic classifier. They applied the classifier to their training dataset and found Norn cells, among other cell types, in species other than mice. They verified the findings by looking for genes that tend to show up only in Norn cells.

Why it matters: UCE’s embeddings encode biologically meaningful information about individual cells, enabling a clustering algorithm to group them into recognized cell types. The fact that the recently discovered Norn cell was among those clusters suggests that UCE may yield further discoveries that accelerate development of new medicines, lab processes, and research methods. In fact, the model found Norn cells — which are known to occur in the kidney — in organs where they have not been seen before. If this result turns out to be valid, UCE will have made a discovery that has eluded biologists to date.

We’re thinking: It’s a truism that a machine learning model is only as good as its data. That makes this work all the more impressive: Its training data included a handful of species, yet it generalized to others.

Subscribe to The Batch