Using machine learning, MIT chemical engineers have created a computational model that can estimate how well any molecule will dissolve in an organic solvent – an important step in almost any drug synthesis. This type of prediction can make it very easy to develop new ways to produce drugs and other useful molecules.
The new model, which predicts how much soluble of a solution will dissolve in a particular solvent, the chemist should help choose the correct solvent for any response in their synthesis, the researchers say. Common organic solvents include ethanol and acetone, and there are hundreds of others who can also be used in chemical reactions.
Lucas Atia, one of the MIT graduate students and key authors of new studies, says, “Prophet solubility is really a rate-limiting step in synthetic planning and chemicals, especially drug manufacturing, so there is a long-standing interest in being able to make better predictions of solidity.”
Researchers have provided their models independently, and many companies and laboratories have already started using it. Researchers say that models may be particularly useful to identify solvents that are less dangerous than the most commonly used industrial solvents.
“There are some solvents that are known to dissolve most of the things. They are really useful, but they are harmful to the environment, and they are harmful to the people, so many companies are required that you have to reduce the amount of solvents you have to use,” Jackson Burns, called Jackson Burns, is called an MIT graduate student, who is also the lead authors of the paper. “Our model is extremely useful in being able to identify the next-master-solitary solvent, which is expected to be very less harmful to the environment.”
William Green, Hoyat Hotle Professor of Chemical Engineering and Director of MIT Energy Initiative, is a senior writer of the study, who appears in today Nature communicationPatrick Dial, Robert T. of Chemical Engineering. Hasalam is also a professor, the writer of the paper.
Soluble
The new model was out of a project, in which Atia and Burns worked together in an MIT course on applying machine learning to chemical engineering problems. Traditionally, chemists have predicted solubility with a tool known as the Abraham solution model, which can be used to estimate the overall solubility of the molecule by adding the contribution of chemical structures within the molecule. While these predictions are useful, their accuracy is limited.
Over the years, researchers have started using machine learning to try to make more accurate solubility predictions. Before Burns and Atia began working on their new models, the state -of -the -art model was a model developed in Green’s lab in 2022 to predict solubility.
The model, known as Solpop, works by predicting a set of related properties and using thermodynamics, eventually combining them to predict solubility. However, the model has difficulty in predicting solubility for solutions that she had not seen earlier.
“For drug and chemical search pipelines where you are developing a new molecule, you want to be able to predict prematurely what its solubility looks,” Atia says.
For the reason that the existing solubility model did not work well because there was no comprehensive dataset to train them. However, a new dataset called BigSoldb was released in 2023, compiling data from about 800 published letters, including solubility information for about 800 molecules, which was dissolved about more than 100 organic solvents that are usually used in synthetic chemistry.
Atia and Burns decided to try to train two different types of models on this data. Both models represent the chemical structures of molecules, which use the numerical representatives known as embeding, which incorporate information such as the number of atoms in a molecule and which atoms are bound by other atoms. Models can then use these representations to predict a variety of chemical properties.
One of the models used in this study, known as fastprop and developed by others in the Burns and Green’s lab, incorporates “static embeding”. This means that the model already knows embeding for each molecule, before it starts making any kind of analysis.
Other models, kempop, learns an embeding for each molecule during training, at the same time when it learns to combine embeding characteristics with a characteristic such as solubility. This model developed in many MIT labs has already been used for functions such as antibiotic discovery, lipid nanopartical design and chemical reaction rates.
Researchers trained both types of models at more than 40,000 data points from BIGSOLDB, including information about the effects of temperature, which plays an important role in solubility. Then, he tested the model on about 1,000 solutions that were stopped from training data. He found that the model predictions were solubrops, two to three times more accurate than the previous best models, and the new models were especially accurate in predicting the variation in solubility due to temperature.
Says Burns, “Being able to accurately reproduce those small changes in solubility due to temperature, even when the experimental noise is too large, the really positive indication was that the network correctly learned an underlying solubility prediction function,” Berns says.
Accurate predictions
Researchers had expected that Kempop based models, which are capable of learning new representations because it goes along, would be able to make more accurate predictions. However, for his surprise, he found that the two models essentially performed equally. Researchers say that this suggests the main limit on their performance is the quality of data, and that along with the model is performing on the basis of theoretically possible data, what they are using, the researchers say.
“When you have enough data, Chemprop should always perform better than any stable embeding,” says Burns. “We were blown to see that the static and learned embeding were all different in the performance performance in all the different -shatter, which indicates us that the data limits in this location dominated the model performance.”
Models can be more accurate, researchers say, if better training and testing data were available – ideally, data received by a person or a group of people, all are trained to use the same way.
“One of the large boundaries of using this type of compiled dataset is that different laboratories use separate methods and experimental conditions when they test solubility. It contributes to this variability between different datasets,” says Atia.
Because the model based on Fastprop makes its predictions faster and has a code that is easy to customize other users, researchers decided to create one who is known as Fastsolva, which is available to the public. Many pharmaceutical companies have started using it.
“The drug discovery pipeline has applications,” Burns. “We are also excited to see, out of formulation and search of medicine, where people can use this model.”
Research was funded by the US Energy Department.