Presented by Marwin Segler at Microsoft Research Forum, February 2025

“We think, in the future, predictive synthesis will really help chemists to accelerate the discovery of new essential molecules.”
– Marwin Segler, Principal Researcher Manager, Microsoft Research AI for Science
-
Microsoft research copilot experience How might the use of AI models in synthesis prediction impact the future of drug discovery?
Transcript: Lightning Talk
Chimera: Accurate synthesis prediction by ensembling models with diverse induction biases
Marwin Segler, Principal Researcher Manager, Microsoft Research AI for Science
This talk addresses chemical synthesis in drug discovery with a learning-to-rank framework that integrates AI-based models, significantly boosting prediction accuracy and preferred by chemists.
Microsoft Research Forum, February 25, 2025
WILL GUYMAN, Group Product Manager, Healthcare AI Models: To design a new medication to defeat an illness, scientists need to predict which blend of molecules can be transformed into medicines, a tedious process that traditionally takes decades and can cost billions. Researchers at Microsoft Research and Novartis have been developing a novel approach to addressing a major bottleneck in this process called retrosynthesis, figuring out how to start from a target molecule and plan the chemical steps needed to make it. In practical terms, that means cutting down on trial-and-error experiments, speeding up how quickly researchers can create new molecules, and ultimately lowering the time and cost needed to develop new treatments. I’ll now hand it over to Marwin, who will explain how this technology works in detail and discuss its potential impact on future drug discovery.
MARWIN SEGLER: Hi, I’m Marwin, principal researcher at Microsoft Research AI for Science. And on behalf of the team, especially Chris and Guoqing, I’m going to tell you a story about life and death. Small organic molecules are central to human well-being. As agrochemicals, they have to feed the planet; as drugs, they keep us healthy and help us, hopefully, to prevent from dying too early; and as materials, they help to improve the quality of our lives. To get access to small molecules, one needs to synthesize them in the lab via the synthesis route. And the synthesis route, you can think about it like a cooking recipe, where we start from the ingredients and then run several steps until we reach the final product. And to plan a synthesis, chemists often start with the targets that they want to make and then work their way recursively backward to the starting materials.
However, synthesis can be super challenging as reactions can fail, and also in this multi-step synthesis the errors can compound. And this is one of the reasons why small molecule drug discovery, for example, is so much slower and more expensive than protein design, where we have had so many recent breakthroughs. And AI models that could help chemists to find better synthesis routes will really have a profound impact on how small molecules are discovered and produced, with the potential to really accelerate the discovery of much-needed new functional organic molecules. So how can we address this major bottleneck? First, we need the synthesis prediction model. And this model takes in a molecule, a target molecule, and predicts a list of feasible reverse chemical reactions, basically.
And this is similar to learning how to predict the moves in a chess game. But in chess it’s relatively simple because we actually don’t really need a model, because we can just implement the rules of the game perfectly as a very simple program. But chemistry is much, much more complicated. So, we need to learn this model from actual experimental data.
We can thus think about this model that we’re learning as a chemical generative world model that predicts which reactions are feasible for a given molecule in a given situation. And once we have such a model, we can plug it into a search algorithm and recursively apply it to get a full multi-step synthesis route.
How can we model chemical reactions? We can represent molecules either as graph structures or using SMILES [Simplified Molecular Input Line Entry System]. SMILES is basically the token sequence representation of the graph and carries the same information. Now, given the target product that we want to make, we could either use an auto-regressive model to generate the SMILES sequence of the reactants de novo.
So in whole, new, token by token. If one knows about language models, it’s very similar to that. This is very appealing, because it leaves it to stochastic gradient descent to figure out this process end to end. However, it has a disadvantage, because in chemical reactions usually only a small part of the molecule changes, and with this de novo model we need to copy the whole, also unchanged parts.
So here, in this reaction, only the marked parts change. So what we could do as well is just predict these edits that we have to apply to the molecule. And these edits can be very well represented using simple rules or so-called templates, which we can just derive from the training data. We could just predict those edits.
Now how do we implement these models? The de novo prediction model, we can implement as a sequence-to-sequence model using modern transformers, using grouped multi-query attention and modern activations.
An edit-based model, we can implement via a dual GNN that includes both the product and these edit templates. And then we perform classification of which is the most appropriate templates in our database, or a collection of templates that we can apply to the molecule.
Now there is an additional complication, which is where do we apply this template in the molecules? Because there can be multiple matches. And for this we have an additional localization model that gives us calls to where the template optimally matches in the molecule. And this second one we can also train with stochastic gradient descent. Now we have the best of both worlds.
But how do we combine the outputs of these two models together? And again, we need something which is learned. And we came up with a new learning-to-rank strategy, where we have an additional model that scores the outputs that the different models provide. And then provides a rescoring, which we can then use to rerank the outputs of the model to build an extremely powerful ensemble of models. And by combining these two models of complementary inductive bias, you will see we get extremely exciting results.
But first, we need to make sure that we’re really able to check what’s going on. And the issue with chemical data is that it often has temporal bias, so it’s somewhat clustered over time. So, if we randomly split the data, we get this weird time-machine effect. So, what we did instead was to make a clean type-split of the data. So we train our model only based on reaction data from patents that was published up to 2023. And then test the models on data published from 2024 onwards.
And then as a metric, we asked the model to make 50 predictions for a test set product, and then measure how many times the model was able to recover the ground truth reactants on this data. Now, we can measure how the models are doing in different regimes and what we’re seeing with all the models that have been published for the baselines is that they tend to work super well when there’s a lot of data, as in the typical deep learning regime. However, reactions where we don’t have lots of examples in the training data are very often super important for synthesis strategy. So, you can see this here where we saw that the performance of models by the frequency of how often different reaction classes occur in the training data. And so far, this has been a major limitation of deep learning models in this domain.
Thankfully, with our model, we can basically not just outperform all the baselines that are typically used in the, in the literature, for the very frequent cases, but we can also really make progress on the classes where we don’t have that many training data and data points in our data sets. And we can even maintain very, very high performance in the cases where we just have two examples in the training set, which is usually super rare to achieve with deep learning models.
And even if we just have one example in the training data, even for the zero-shot case, we can still achieve reasonable performance whereas the baselines drop off. Yeah, completely, basically. And that’s super important for synthesis strategy.
Now, another question is how robust is the model where we move further away from the training data? And it’s very important in discovery because, by definition, we need to make predictions on new things, on new molecules that have never been made before. And we can measure that by chemical similarity. So how far the molecules in the test set are away from the training data. And existing baselines, they drop off quite a bit. But with our new ensemble, we can basically achieve a step change in how we can predict the further we go away from the data, and we can completely maintain high performance even when we move very far away from the training data, giving us a sense of the out of distribution prediction capabilities of our model.
And why is it important? Again, in drug discovery, one needs to make new molecules that have never been made before. Structurally, very new, very different. And with these improvements, we can now apply synthesis prediction with much, much more confidence to new molecules. And to give you an example of how that would look in practice, here’s a synthesis route predicted by our model for a molecule you could typically expect in a drug discovery project, which is non-trivial, so it’s quite a long sequence of steps.
And, just to give you an example of rare reaction classes, the model is able to predict these specific Hemetsberger-Knittel Indole Synthesis step, which maybe as a chemist you would not immediately think about. But the model is able to retrieve it and propose it. And in this context, it actually makes sense. So, to give you one example of how these rare reaction classes can be highly strategic.
And we think, in the future, predictive synthesis will really help chemists to accelerate the discovery of new essential molecules. And if that excites you, check out the extensive results in our paper, including validation with our great collaborators at Novartis. And up next, you’re going to hear from Jianwei Yang from the Microsoft Research AI Frontiers team to introduce Magma.
Thank you for listening.
Related resources
The post Chimera: Accurate synthesis prediction by ensembling models with diverse induction biases appeared first on Microsoft Research.