Benchmarking chemical exploration in de novo drug design with MolExp
TL;DR: The Molecular Exploration (MolExp) is a new benchmark integrated into MolScore. This leverages the converse similarity principle to probe a generative model's ability to explore chemical space. This benchmark is much more difficult than current benchmarks. Hopefully, this helps to guide algorithm development to a future where all molecules in chemical space can be found quickly, for a given context. Preprint and code links below.
Background
Generative molecular design has already moved beyond the theoretical with many workflows already experimentally validated (Du et al., 2024). I want to take a small step back and re-evaluate the common objective of goal-directed generative algorithms in the context of drug design...
Goal-directed drug design: Automatically 'build' molecules such as to maximise an oracle score / fitness function / scoring function / desirability criteria, usually a combination of predicted molecular properties. We generally task our models with finding a molecule
Small molecule drug design: Theoretically the goal of drug design - to find the optimal drug candidate that is efficacious, bio-available and safe (and non-patented) - is the same i.e., find 'the needle in the haystack'. The reality is somewhat different, our oracles are inaccurate, multi-parameter optimisation is a Pareto optimal problem that identifies different property profiles, and chemical space is enormous. So I think it's more realistic to say that there are many 'needles in the haystack' and we want to, for example, identify all molecules with scores above a threshold T.
So goal-directed drug design should align to identify all molecules that maximise
Current benchmarks: Unfortunately, I don't think current benchmarks account for this particular endpoint. Most GuacaMol (Brown et al., 2019) and MolOpt (Gao et al, 2022) tasks focus on either 1) target rediscovery, 2) target similarity, or 3) predicted properties. We can stipulate how the

Post-hoc evaluation metrics that account for the top k molecules, scaffolds, or even top k diverse molecules can help (Thomas et al, 2022; Renz et al, 2024), but do not measure well a models explorative capability if the task itself does not require much exploration.
Molecule Exploration Benchmark
The Molecule Exploration (MolExp) benchmark requires intrinsic exploration by tasking models with multiple target rediscovery. Where, the oracle is the maximum similarity of the de novo molecule to a set of target molecules
This benchmark consists of four tasks, and takes inspiration from the converse of the similarity principle, where dissimilar molecules nonetheless possess similar bioactivity (also leveraged to benchmark predictive models using activity cliffs (Van Tilborg et al, 2022)). This benchmark includes four tasks, each with 2-4 contextually diverse molecular targets that share similar bioactivity: anti-psychotic drugs (AP), Adenosine A
How do current generative models perform?
Using chemical language models (CLMs) pre-trained on ChEMBL 34 (a training dataset which includes all of the target molecules), here's how different reinforcement learning (RL) algorithms perform, trained with budget of 10,000 molecules. In this case, REINVENT does not outperform virtual screening (which randomly samples 10,000 molecules from the training dataset). The best algorithm is ACEGEN
If we look at similarity to each target during training (each curve in the plot below), we can see that even though ACEGEN
Long story short, this benchmark can be solved but requires 128 independent ACEGEN
I hope this benchmark can grow in the number and difficulty of tasks which can be proposed by the community to guide generative algorithms. Eventually realising a future where we can quickly identify all drug-like molecules within enormous chemical spaces, as we do with much smaller explicit chemical spaces via virtual screening.
Running MolExp
The MolExp benchmark is available in MolScore (Thomas et al., 2024) and can be install via PyPI.
$ pip install "MolScore>=1.9"
Two versions of the benchmark are available:
- MolExp: This computes the similarity to the target molecules using the Tanimoto similarity of the ECFP4 fingerprints with bit counts.
- MolExpL: This computes the similarity to the target molecules using a Levenshtein (string-edit) similarity measure between SMILES (both canonicalised). Note this is the version used in the results shown above and in the preprint, for the reason that CLMs operate in string space.
from molscore import MolScoreBenchmark
msb = MolScoreBenchmark(
model_name='GM',
output_dir='./results',
benchmark='MolExpL', # Or 'MolExp'
budget=10_000
)
for task in msb:
while not task.finished:
SMILES = GM.sample() # Sample molecules from your generative
scores = task.score(SMILES) # Pass them to MolScore as SMILES to score
# Use the scores to compute the loss and update your generative model
That's it! Benchmark results will be saved to 'results.csv' in the directory created. Note that five replicates should be run with different seeds, but that's on you.
Links
- Benchmark preprint: "Test-Time Training Scaling for Chemical Exploration in Drug Design"
- ACEGEN
preprint: "REINFORCE-ING Chemical Language Models in Drug Design" - MolScore: Code to run MolExp benchmark on a generative model of your choice.
- ACEGEN: Code used to run a CLM with different RL configurations on the MolExp benchmark.
- Cheminformantics code: Results and jupyter-notebook generating the figures shown in this blog.
References
- Brown, Nathan, et al. "GuacaMol: benchmarking models for de novo molecular design." Journal of Chemical Information and Modeling 59.3 (2019): 1096-1108.
- Du, Yuanqi, et al. "Machine learning-aided generative molecular design." Nature Machine Intelligence (2024): 1-16.
- Gao, Wenhao, et al. "Sample efficiency matters: a benchmark for practical molecular optimization." Advances in Neural Information Processing Systems 35 (2022): 21342-21357.
- Renz, Philipp, Sohvi Luukkonen, and Günter Klambauer. "Diverse Hits in De Novo Molecule Design: Diversity-Based Comparison of Goal-Directed Generators." Journal of Chemical Information and Modeling 64.15 (2024): 5756-5761.
- Thomas, Morgan, Albert Bou, and Gianni De Fabritiis. "REINFORCE-ING Chemical Language Models in Drug Design." arXiv preprint arXiv:2501.15971 (2025).
- Thomas, Morgan, et al. "MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design." Journal of Cheminformatics 16.1 (2024): 64.
- Thomas, Morgan, et al. "Re-evaluating sample efficiency in de novo molecule generation." arXiv preprint arXiv:2212.01385 (2022).
- Van Tilborg, Derek, Alisa Alenicheva, and Francesca Grisoni. "Exposing the limitations of molecular machine learning with activity cliffs." Journal of Chemical Information and Modeling 62.23 (2022): 5938-5951.
Comments
Post a Comment