The Language of Viruses Researchers trained a neural net to predict viruses in DNA.

Published

Jan 27, 2021

Reading time

2 min read

A neural network learned to read the genes of viruses as though they were text. That could enable researchers to page ahead for potentially dangerous mutations.

What’s new: Researchers at MIT trained a language model to predict mutations that would enable infectious viruses — including the SARS-CoV-2 virus that causes Covid-19 — to become even more virulent.

Key insight: The authors suggest that the immune system’s response to viruses is similar to the way people understand natural language. A virus that causes infection has a “grammar” that’s biologically correct, and it also has a semantic “meaning” to which the immune system does or doesn’t respond. Mutations can enhance these worrisome qualities.

How it works: The authors trained a bidirectional LSTM on the genetic equivalent of making a language model guess a missing word in a sentence. The training set included gene sequences from a variety of infectious bugs: 45,000 variants of influenza, 60,000 of HIV, and 4,000 of SARS-CoV-2.

The researchers trained the biLSTM to fill in a missing amino acid in a sequence. Along the way, the model generated embeddings that represent relationships among sequences.
Then they generated mutated sequences by changing one amino acid at a time.
To rank a given mutation, they took a weighted sum of the likelihood that the mutated virus retained an infectious grammar and the degree of semantic difference between the original and mutated sequence’s embeddings.

Results: The researchers compared their model’s highest-ranked mutations to those of actual viruses according to the area under curve (AUC), where 0.5 is random and 1.0 is perfect. The model achieved 0.85 AUC in predicting SARS-CoV-2 variants that were highly infectious and capable of evading antibodies. It achieved 0.69 AUC for HIV, and 0.77 AUC and 0.83 AUC respectively for two strains of influenza.

Behind the news: Other researchers have also explored similarities between language and gene sequences. For example, Salesforce researchers trained a language model to treat amino acids like words and build grammatically correct “sentences” of functional proteins that could be used in medicine.

Why it matters: Discovering dangerous viral mutations typically takes weeks, as scientists must analyze DNA taken from patients. The ability to predict harmful mutations could help them find dangerous variants sooner, helping epidemiologists update their models and giving researchers a head start on vaccines and therapies.

We’re thinking: The Batch is grammatically correct but not infectious. Though we wouldn’t mind if it went viral!

Subscribe to The Batch