Within the last few years, models that can predict the structure or function of proteins have been widely used for a variety of biological applications, such as identifying the goals of the drug and designing new therapeutic antibodies.
These models, which are based on large language models (LLM), can make very accurate predictions of protein suitability for a given application. However, there is no way to determine how these models make their predictions or which protein features play the most important role in those decisions.
In a new study, MIT researchers have used a novel technique to open that “black box” and allow them to determine what a protein language model takes into features when doing predictions. Understanding what is happening inside that black box can help researchers to choose a better model for a particular task, which can help to make the process of identifying new drugs or vaccine goals effective.
The Commitation and Biology Group’s Simons Professor at MIT’s Computer Science and Artificial Intelligence Laboratory, there are widespread implications in our work for increased clarity in downstream work that rely on these representations. “Additionally, identifying characteristics that have the ability to reveal novel biological insight by these representations in the protein language model track.”
An MIT graduate student Onkar Gujral is the lead author of Open-access studies, which appears in this week Action of National Science Academy. Mihir Bafna, an MIT graduate student in electrical engineering and computer science, and an MIT Professor of Biological Engineering, Eric Alm, are also the author of Paper.
Black box opening
In 2018, Berger and former MIT graduate student Triston Bepar PhD ’20 introduced the first protein language model. Their models, like the latter protein models, were based on LLMS, which accelerates the development of alphafold such as ESM2 and Omegafold. These models, including Chatgpt, can analyze a large amount of text and find out which words are the most likely to appear together.
Protein language models use a similar approach, but instead of analyzing words, they analyze amino acid sequences. Researchers have used these models to predict the structure and function of proteins, and for applications such as identifying proteins that may be bound to special drugs.
In a study of 2021, Burzers and colleagues used a protein language model to guess which classes of viral surface proteins are less likely to be mutated in a way that enables viral migration. This allowed them to identify possible goals for vaccines against Influenza, HIV and Sars-Cov-2.
However, in all these studies, it is impossible to know how models were making their predictions.
“We would predict something in the end, but we did not know at all what is happening in the individual components of this black box,” Berger says.
In the new study, researchers excavated how protein language models make their predictions. Like LLM, protein language models encoded information in the form of information that contains a pattern of activation of separate “nodes” within a nerve network. These nodes correspond to the network of neurons that store memories and other information within the brain.
It is not easy to explain the internal functioning of the LLM, but within the last few years, researchers have begun using a type of algorithm, known as a sparse autoqodor to help those models make their predictions. The new study from Berger Lab is the first to use this algorithm on the protein language model.
Repevated autoNCoders work by adjusting how to represent a protein within a nervous network. Typically, a given protein will be represented by a pattern of activation of a constrained number of neurons, for example, 480. A rare autoencoder will expand the representation that is in a large number of nodes, 20,000.
When the information about a protein is encoded only by 480 neurons, each node lights for many characteristics, making it very difficult to know what each node encoding is. However, when the nerve network is expanded up to 20,000 nodes, it gives the information room to “spread” with an additional space -a sparsity barrier. Now, a feature of protein that was previously encoded by several nodes can capture a single node.
“In a sparse representation, neurons are doing this in lighting up and more meaningful,” Gujral says. “Before rare representations are made, network information is so tightly packed together that it is difficult to interpret neurons.”
Interpretable model
Once the researchers received a rare representation of many proteins, they used the AI assistant called Cloud (belonging to the popular anthropic chatbot of the same name) to analyze the representation. In this case, he asked the cloud to compare the known characteristics of each protein, such as molecular function, protein family or space within a cell.
By analyzing thousands of representations, the cloud can determine which nodes suit specific protein features, then describe them in plain English. For example, the algorithm may say, “It detects proteins involved in neuron ions or transmibrain transports of amino acids, especially located in the plasma membrane.”
This process makes the nodes more “interpretable”, which means that researchers can explain what encoding each node. He found that the most likely to encoded by these nodes was protein families and some functions, including many different metabolism and biosynthetic processes.
“When you train a rare autoencoder, you are not trained to be interpreted, but it is revealed that by encouraging the representation to really rare, which ends in the lecturer,” says Gujral.
Understanding what a particular protein model is encoding, can help researchers choose the right model for a particular task, or Twitch the type of input giving to the model to generate the best results. Additionally, analyzing the characteristics that a model encoded one day can help the biologist know more about the proteins they are studying.
“At some point when the models become much more powerful, you can already learn more biology, which already know, by opening the model,” Gujral says.
Research was funded by the National Institute of Health.