Natural Language Processing (NLP) algorithms have played a tremendous role in analyzing text sentiment and extracting meaning, hence why they have spurred the development of applications like chat bots and virtual assistants. However, these same algorithms have now been equipped with a surprising yet welcome ability: to generate protein sequences and predict virus mutations.
In a study published in Science today, computational biologist Bonnie Berger and her colleagues demonstrated how NLP can be used to predict mutations that allow viruses to avoid detection by antibodies in the human immune system, a process that is aptly know as viral immune escape.
As with many developments in the world of science and technology, the fundamental insight behind this one is pretty simple. As it happens, many properties of biological systems can be understood in terms of words and sentences. In this case, interpreting the protein sequence of a virus is very much like interpreting the sequence of words and characters in a sentence.
Berger’s team used two different linguistic concepts: grammar and semantics (meaning). The genetic or evolutionary fitness of a virus (characteristics such as how good it is at infecting a host) can be interpreted in terms of grammatical correctness. A successful, infectious virus is “grammatically correct”, while an unsuccessful one is not.
Similarly, mutations of a virus can be interpreted in terms of semantics. Mutations that make a virus appear different to things in its environment, such as changes in its surface proteins that make it invisible to certain antibodies, have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.
In order to model these properties, the team used a Long Short Term Memory (LSTM) neural network that was trained on thousands of genetic sequence taken from three different viruses: 45,000 unique sequences extracted for influenza, 60,000 for HIV, and somewhere between 3000 and 4000 for a strain of coronavirus.
Why lesser data for the coronavirus strain? According to Brian Hie, an MIT student involved in building the models, this is simply because there has been less surveillance of the virus responsible for the COVID-19 pandemic.
The ultimate aim of the team was to identify mutations that might let a virus escape the immune system without making it less infectious. In more NLP-friendly words, this means that they are trying to find mutations that change the virus’s meaning without making it grammatically incorrect.
To test their approach, the team used a common metric for assessing predictions made by machine-learning models that scores accuracy on a scale between 0.5 (no better than chance) and 1 (perfect). In this case, they took the top mutations identified by the tool and, using real viruses in a lab, checked how many of them were actual escape mutations. Their results ranged from 0.69 for HIV to 0.85 for one coronavirus strain. This is better than results from other state-of-the-art models, they say.
The utility of NLP algorithms identifying coronavirus mutations lies in the fact that hospitals and public health institutes can use the knowledge to proactively plan for the future. For instance, the algorithm can let you know how much a flu strain’s meaning has changed over a certain period of time, and knowing that can help an expert determine how well the antibodies developed by the patients’ immune systems are performing.
The team has so far been busy running their models on all kinds of coronavirus variants, including the notorious British variant, the mink mutation from Denmark, and other variants from South Africa, Singapore, and Malaysia.