Models like AlphaFold have made great strides in finding protein shapes, which determine their biological functions. New work separated proteins into functional families without considering their shapes.
What’s new: A team led by Maxwell L. Bileschi classified protein families using a model (called ProtCNN) and a process (called ProtREP) that used that model’s representations to address families that included fewer than 10 annotated examples. The project was a collaboration between Google, BigHat Biosciences, Cambridge University, European Molecular Biology Laboratory, Francis Crick Institute, and MIT.
Key insight: A neural network that has been trained on an existing database of proteins and their families can learn to assign a protein to a family directly. However, some families offer too few labeled examples to learn from. In such cases, an average representation of a given family’s members can provide a standard of comparison to determine whether other proteins fall into that family.
How it works: The authors trained a ResNet on a database of nearly 137 million proteins and nearly 18,000 family classifications.
- The authors trained the model to classify proteins in roughly 13,000 families that each contained 10 or more examples.
- Taking representations from the second-to-last layer, they averaged the representations of proteins in each family.
- At inference, they compared an input protein’s representation with each family’s average representation. They chose the family whose average matched most closely according to cosine similarity.
- In addition, they built an ensemble of 19 trained ResNets that determined classifications by majority vote.
Results: The ensemble model achieved accuracy of 99.8 percent, higher than both comparing representations (99.2 percent) and the popular method known as BLASTp (98.3 percent). When classifying members of low-resource families, the representation-comparison method achieved 85.1 percent accuracy. Applying the ensemble to unlabeled proteins increased the number of labeled proteins in the database by nearly 10 percent — more than the number of annotations added to the database over the past decade.
Why it matters: New problems don’t always require new methods. Many unsolved problems — in biology and beyond — may yield to well established machine learning approaches such as few-shot learning techniques.
We’re thinking: Young people, especially, ought to appreciate this work. After all, it’s pro-teen.