
A protein located in the incorrect part of a cell can contribute to many diseases, such as Alzheimer’s, cystic fibrosis and cancer. But the same human cell has about 70,000 different proteins and protein variants, and since scientists can usually test for a handful of only in one experiment, it is extremely expensive and taking time to manually identify the places of protein.
A new generation of computational techniques attempts to streamline the process using a machine-learning model that often take advantage of thousands of proteins and datasets containing their locations, which are measured in many cell lines. One of the largest such datasets is human protein atlas, which lists the sub -behavior of more than 13,000 proteins in more than 40 cell lines. But as it is very large, human protein atlas has detected only 0.25 percent of all the potential pair of all protein and cell lines within the database.
Now, researchers from MIT, Harvard University and Broad Institute of MIT and Harvard have developed a new computational approach that can efficiently detect the remaining unwanted location. Their method can predict the location of any protein in any human cell line, even when both protein and cell have never been tested before.
Their technique goes one step further than several AI-based methods by making a protein locally at a single-cell level, rather than an average estimate in all cells of a specific type. For example, this single-cell localization may indicate the location of a protein in a specific cancer cell after treatment.
Researchers combined a protein language model to capture a protein and rich details about the cell with a special type of computer vision model. Finally, the user receives an image of a cell that indicates the prediction of the model with a highlighted portion where the protein is located. Since localization of a protein is a sign of its functional status, this technique can help researchers and physicians to diagnose diseases more efficiently or identify the goals of the drug, while enabled the biologist to understand how complex biological processes are related to protein localization.
“You can do these protein-localization experiments on a computer without touching any lab bench, hopefully to protect yourself from months of effort. While you will still need to verify the prediction, this technique will be a graduate student Yitong Tso in MIT’s computers and system biology programs.
TSEO is included on paper by Xinyi Zhang, a graduate student at the Department of Electrical Engineering and Computer Science (EECs) and co-head of Eric and Wendy Squirt Center at Broad Institute; Unhao Bai of Broad Institute; And senior writer Fee Chen, an assistant professor at Harvard and a member of the Broad Institute, and Caroline Uhraler, Engineering’s Andrew and Erna Viterbi professor and MIT Institute for Data, System, and Society (IDSS), who are also for Eric and a researcher and a researcher for Data, System, and Society (IDSS). Research appears in today Nature methods,
Cooperation model
Many existing protein prediction models can only make predictions based on protein and cell data on which they were trained or they are unable to indicate a protein location within a single cell.
To remove these boundaries, researchers made a two-part method to predict the sub-narrow space of unseen proteins, called puppies.
The first part uses a protein sequence model to catch the localization-determination properties of a protein and its 3D structure based on the series of amino acids that makes it.
The second part includes an image inpainting model, designed to fill in the missing parts of an image. This computer vision model sees three stained images of a cell to gather information about the position of that cell, such as its type, personal characteristics, and whether it is under stress.
Puppies join the representation created by each model, where it shows the approximate location that the protein is located within a single cell, using an image decoder to output a highlighted image.
“Different cells within a cell line display different characteristics, and our model is capable of understanding that nuances,” Tso says.
A user inputs the sequence of amino acids that manufactures proteins and three cell stain images – for a nucleus, for a microthetic, and for an endoplasmic reticulum. Puppies then do the rest of the work.
A deep understanding
Researchers planned some tricks during the training process to taught the puppies that how to combine information from each model in such a way that it could make an educated estimate in place of protein, even if it has not seen that protein before.
For example, they offer a secondary function to the model during training: to clearly name the localization compartment like cell nucleus. This is done with primary inpainting tasks to help the model learn more effectively.
A good analogy can be a teacher who asks his students to attract all parts of a flower besides writing their names. This additional step model was found to help the model improve its general understanding of potential cell coaches.
In addition, the fact is that the puppies are trained on protein and cell lines at the same time, it helps develop a deep understanding where protein localization in a cell image.
Puppies can also understand, on their own, how different parts of the sequence of protein contribute separately to their overall localization.
“Most other methods usually require a stain of protein first, so you have already seen it in your training data. Our approach is unique in the sense that it can normalize protein and cell lines at the same time,” says Zhang.
Because puppies can normalize unseen proteins, it can capture changes in localization operated by unique protein mutations that are not included in human protein atlas.
Researchers verified that puppies can predict the subcutaneous location of new proteins in unseen cell lines by conducting laboratory experiments and comparing results. Furthermore, when compared to a baseline AI method, Puppies demonstrated the average low prediction error in the proteins that they tested.
In the future, researchers want to increase the puppies so that the model can understand protein-protein interaction and make localization predictions for several proteins within a cell. In the long term, they want to enable the puppies to make predictions in terms of living human tissue rather than cultured cells.
The research is funded by Broad Institute, National Institute of Health, The National Science Foundation, The Baroz Welcom Fund, The Siral Scools Foundation, The Harvard Stem Sale Institute, Merkin Institute, The Office of Naval Research and Energy Department.