Today, 99.999 percent of the estimated 1 trillion species on Earth are thought to be microbial – bacteria, archaea, viruses and single-celled eukaryotes. For most of our planet’s history, microorganisms ruled the Earth, able to live and thrive in the most extreme environments. Researchers have just begun to grapple with the diversity of microbes over the past few decades—it’s estimated that less than 1 percent of known genes have laboratory-validated functions. Computational approaches offer researchers the opportunity to strategically parse this truly astonishing amount of information.
An environmental microbiologist and computer scientist by training, the new MIT faculty member Yunha Hwang There is interest in the novel biology revealed by the most diverse and abundant life form on Earth. In Samuel A. to a shared faculty position as the Goldblith Career Development Professor. Biology Department, Also an assistant professor Department of Electrical Engineering and Computer Science and this MIT Schwarzman College of ComputingHwang is exploring the intersection of computation and biology.
Why: What attracted you to researching microbes in extreme environments, and what are the challenges of studying them?
A: Extreme environments are great places to find interesting biology. I wanted to be an astronaut growing up, and the closest thing to astronomy is investigating extreme environments on Earth. And the only thing that can live in those extreme environments are microorganisms. During a sampling expedition I participated in off the coast of Mexico, we found a colorful microbial mat under about 2 kilometers of water that was thriving because the bacteria breathed sulfur rather than oxygen—but none of the microbes I was hoping to study would grow in the lab.
The biggest challenge in studying microbes is that most of them cannot be cultivated, meaning the only way to study their biology is through a method called metagenomics. My latest work is genomic language modeling. We are hoping to develop a computational system so that we can examine the organism as closely as possible “in silico” using only sequence data. A genomic language model is technically a larger language model, except that the language is DNA as opposed to human language. It is trained in a similar manner, in an organic language, unlike English or French. If our aim is to learn the language of biology, we must take advantage of the diversity of microbial genomes. Even though we have a lot of data, and as more samples become available, we have just scratched the surface of microbial diversity.
Why: Given how diverse microbes are and how little we understand about them, how can studying microbes in silico using genomic language modeling advance our understanding of microbial genomes?
A: A genome consists of several millions of letters. No human being can possibly see it and understand its meaning. However, we can program a machine to break data into useful pieces. This is how bioinformatics works with single genomes. But if you’re looking at a gram of soil, which may contain thousands of unique genomes, that’s a lot of data to work with – a human and a computer are needed together to deal with that data.
During my PhD and master’s degrees, we were just discovering new genomes and new lineages that were very different from anything that had been described or developed in the laboratory. These were the things we used to call “microbial dark matter.” When there are a lot of unnatural things happening, that’s where machine learning can be really useful, because we’re just looking for patterns – but that’s not the end goal. What we hope to do is to map these patterns to the evolutionary relationships between every genome, every microbe, and every instance of life.
Previously, we have been thinking about proteins as a standalone entity – which provides us with a good amount of information because proteins are related by homology, and so things that are evolutionarily related may have the same function.
What is known about microbiology is that proteins are encoded in the genome, and the context in which that protein is enclosed – which regions come before and after – is evolutionarily conserved, especially if there is any functional pairing. This makes complete sense because when you have three proteins that need to be expressed together because they form a unit, you want them to be located right next to each other.
What I want to do is to incorporate more of that genomic context into the way we discover and annotate proteins and understand protein function, so that we can go beyond sequence or structural similarity and add contextual information to how we understand proteins and hypothesize about their functions.
Why: How can your research be applied to harness the functional potential of microbes?
A: Microorganisms are probably the world’s best chemists. Taking advantage of microbial metabolism and biochemistry will lead to more sustainable and more efficient methods for producing new materials, new therapeutics and new types of polymers.
But it’s not just about efficiency – microorganisms are doing chemistry we don’t even know to think about. Understanding how microbes work, and also being able to understand their genomic structure and their functional potential, will be really important as we think about how our world and climate is changing. Most carbon sequestration and nutrient cycling is done by microorganisms; If we do not understand how a microbe is able to fix nitrogen or carbon, we will face difficulties modeling Earth’s nutrient flows.
On the more medical side, infectious diseases are a real and growing threat. It is really important to understand how microorganisms behave in diverse environments relative to the rest of our microbiome as we think about the future and combat microbial pathogens.