Studying gene expression in cancer patient cells can help clinical biologists understand the origins of cancer and predict the success of various treatments. But cells are complex and have many layers, so how biologists take measurements affects what data they can get. For example, measuring proteins in a cell may provide different information about the effects of cancer than measuring gene expression or cell morphology.
Where does the information about matters in the cell come from? But to get complete information about the state of the cell, scientists must often make multiple measurements using different techniques and analyze them one at a time. Machine-learning methods can speed up the process, but existing methods combine all the information from each measurement method together, making it difficult to trace which data came from which part of the cell.
To overcome this problem, researchers at the Broad Institute of MIT and Harvard and the ETH Zurich/Paul Scherrer Institute (PSI) have developed an artificial intelligence-powered framework that learns which information about cell state is shared across different measurement modalities and which information is unique to a particular measurement type.
By tracking which information came from which cell part, the approach provides a more holistic view of the cell’s state, making it easier for biologists to see the full picture of cellular interactions. This could help scientists understand disease mechanisms and track the progression of cancer, neurodegenerative disorders like Alzheimer’s, and metabolic diseases like diabetes.
“When we study cells, one measurement is often not enough, so scientists develop new techniques to measure different aspects of cells. While we have many ways to look at a cell, at the end of the day we only have one underlying cell state. By putting information from all these measurement modalities together in a better way, we can get a complete picture of the cell’s state,” said lead author Xinyi Zhang SM, a former graduate student in MIT’s Department of Electrical Engineering and Computer Science. ’22, says PhD ’25. (EECS) and an affiliate of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, who is now a group leader at Athera in Vienna, Austria.
Zhang is joined by GV Sivasankar, professor in the Department of Health Sciences and Technology at ETH Zurich and head of the Multiscale Bioimaging Laboratory at PSI, on a paper about the work; and senior author Caroline Uhler, professor in EECS and the Institute for Data, Systems, and Society (IDSS) at MIT, a member of MIT’s Laboratory for Information and Decision Systems (LIDS), and director of the Eric and Wendy Schmidt Center at the Broad Institute. Research has come out today nature computational science.
manipulating multiple measurements
There are many tools that scientists can use to obtain information about the state of a cell. For example, they can measure RNA to see whether the cell is growing, or they can measure chromatin morphology to see whether the cell is dealing with external physical or chemical signals.
“When scientists perform multimodal analyses, they gather information using multiple measurement modalities and integrate it to better understand the underlying state of the cell. Some information is captured by only one modality, while other information is shared across all modalities. To fully understand what is happening inside the cell, it is important to know where the information came from,” says Sivasankar.
Often, for scientists, the only way to sort it out is to conduct several individual experiments and compare the results. This slow and cumbersome process limits the amount of information they can collect.
In the new work, the researchers created a machine-learning framework that specifically understands what information overlaps between different modalities, and what information is unique to a particular modality but is not captured by others.
“As a user, you can simply input your cell data and it automatically tells you which data is shared and which data is modality-specific,” says Zhang.
To build this framework, the researchers reconsidered the typical way to design machine-learning models to capture and interpret multimodal cellular measurements.
Typically these methods, known as autoencoders, have one model for each measurement method, and each model encodes a different representation for the data captured by that method. The representation is a compressed version of the input data that removes any irrelevant details.
The MIT method consists of a shared representation space where data that overlaps between multiple modalities is encoded, as well as separate spaces where unique data from each modality is encoded.
In essence, one can think of it like a Venn diagram of cellular data.
The researchers also used a special, two-step training process that helps their model handle the complexity involved in deciding which data should be shared across multiple data modalities. After training, the model can identify which data is shared and which is unique when it is fed cell data it has never seen before.
disaggregate data
In tests on synthetic datasets, the framework correctly captured known shared and modality-specific information. When they applied their method to real-world single-cell datasets, it comprehensively and automatically distinguished between gene activity jointly captured by two measurement modalities, such as transcriptomics and chromatin accessibility, while also correctly identifying which information came from only one of those modalities.
Additionally, the researchers used their method to identify which measurement method captured a certain protein marker that indicates DNA damage in cancer patients. Knowing where this information comes from will help a clinical scientist determine what technique they should use to measure that marker.
“There are a lot of modalities in a cell and we can’t possibly measure them all, so we need a prediction tool. But then the question is: Which modalities should we measure and which modalities should we predict? Our method can answer that question,” says Uhler.
In the future, the researchers want to enable the model to provide more explanatory information about the state of the cell. They also want to conduct additional experiments to ensure it can accurately resolve cellular information and apply the model to a broader range of clinical questions.
“Integrating information from all these modalities is not enough,” says Uhler. “If we carefully compare different modalities to understand how different components of cells regulate each other, we can learn a lot about the state of the cell.”
This research is partially funded by the Eric and Wendy Schmidt Center of the Broad Institute, the Swiss National Science Foundation, the US National Institutes of Health, the US Office of Naval Research, AstraZeneca, the MIT-IBM Watson AI Lab, the MIT J-Clinic for Machine Learning and Health, and the Simmons Investigator Award.