Every gene in the human genome exists in a variety of mutations, and some encode protein variants that cause cells to malfunction, resulting in illness. Yet which mutations are associated with disease is largely unknown. Can deep learning identify them?
What’s new: Jonathan Frazer, Pascal Notin, Mafalda Dias, and colleagues at Harvard Medical School and University of Oxford introduced Evolutionary Model of Variant Effect (EVE), a neural network that learned to classify disease-causing protein variants — and thus dangerous mutations — without labeled data.
Key insight: Mutations that encode disease-causing proteins tend to be rare because individuals who carry them are less likely to survive to reproductive age. Thus the prevalence of a given mutation indicates its potential role in illness. Among a collection of variants on a particular protein — a protein family — each variant is produced by a distinct mutation of a particular gene. Clustering uncommon and common variants within the family can sort the mutations likely to be associated with disease.
How it works: A variational autoencoder (VAE) learns to reproduce an input sequence by maximizing the likelihood that output tokens match the corresponding input tokens. In this case, the sequence is a chain of amino acids that make up a protein in a database of 250 million proteins. The authors trained a separate VAE for each protein family. Given one variant in a protein family, it learned to compute the likelihood of each amino acid in the sequence. This enabled the authors to derive the likelihood of the entire sequence.
- Within each protein family, the authors computed the likelihood of each variant. The authors assigned a rareness score to each variant based on the difference in likelihood between the variant and the most common version.
- The authors fitted a Gaussian mixture model, which learns a number of Gaussian distributions to assign data points to clusters, to the rareness scores for all variants in a family. They generated two clusters: one each for rare and common variants.
- They classified variants from the common cluster as benign and the variants from the uncommon cluster as disease-causing. They classified the 25 percent of variants that were most in-between clusters as uncertain.
- Having classified a protein, they applied the same classification to the gene that encoded it.
Results: The authors compared EVE’s classifications to those of 23 supervised and unsupervised models built to perform the same task. They checked the models’ classifications for 3,219 genes for which labels are known. EVE achieved 0.92 AUC, or average area under the curve, while other methods achieved between 0.7 AUC and 0.9 AUC (higher is better). The authors also compared EVE’s output with lab tests that measure, for example, how cells that contain mutations respond to certain chemicals. EVE scored as well as or better than those tests on the five gene families in which labels are known with highest confidence. For example, for the gene known as TP53, EVE achieved 0.99 AUC while the lab test achieved 0.95 AUC.
Why it matters: Unsupervised clustering can substitute for labels when we have a belief about what caused certain clusters to emerge; for instance, that natural selection reduces the likelihood of disease-causing protein variants. This approach may open doors to analyze other large datasets in which labels are unavailable.
We're thinking: Clustering unlabeled data and examining the clusters for insights is a tried-and-true technique. By employing VAEs to assess likelihoods, this work extends basic clustering to a wider array of problems.