Benchmarking chemical exploration in de novo drug design with MolExp

TL;DR: The Molecular Exploration (MolExp) is a new benchmark integrated into MolScore. This leverages the converse similarity principle to probe a generative model's ability to explore chemical space. This benchmark is much more difficult than current benchmarks. Hopefully, this helps to guide algorithm development to a future where all molecules in chemical space can be found quickly, for a given context. Preprint and code links below.

Background

Generative molecular design has already moved beyond the theoretical with many workflows already experimentally validated (Du et al., 2024). I want to take a small step back and re-evaluate the common objective of goal-directed generative algorithms in the context of drug design...

Goal-directed drug design: Automatically 'build' molecules such as to maximise an oracle score / fitness function / scoring function / desirability criteria, usually a combination of predicted molecular properties. We generally task our models with finding a molecule \( x^* \) that maximises our score \( s(x) \).

\[ x^* = \arg\max_{x \in X} s(x)\]

Small molecule drug design: Theoretically the goal of drug design - to find the optimal drug candidate that is efficacious, bio-available and safe (and non-patented) - is the same i.e., find 'the needle in the haystack'. The reality is somewhat different, our oracles are inaccurate, multi-parameter optimisation is a Pareto optimal problem that identifies different property profiles, and chemical space is enormous. So I think it's more realistic to say that there are many 'needles in the haystack' and we want to, for example, identify all molecules with scores above a threshold T.

\[ \exists x_i, ..., x_N \in X, \, x_i \neq x_j \text{ such that } s(x_i) \ge T \]

So goal-directed drug design should align to identify all molecules that maximise \( s(x) \). This requires a level of intrinsic exploration.

\[ X^* = \arg\max_{x \in X} s(x) \]

Current benchmarks: Unfortunately, I don't think current benchmarks account for this particular endpoint. Most GuacaMol (Brown et al., 2019) and MolOpt (Gao et al, 2022) tasks focus on either 1) target rediscovery, 2) target similarity, or 3) predicted properties. We can stipulate how the \( s(x) \) landscape may look in a chemical space manifold for these tasks below. Where target rediscovery has a single optimal point in chemical space, which is slightly broader for similarity, and predicted properties such as JNK3 or DRD2 predicted bioactivity have various regions of high \( s(x) \) (depicted as low energy minima). This is of course totally hypothetical, but embedding chemical space in an objective way isn't easy.

Post-hoc evaluation metrics that account for the top k molecules, scaffolds, or even top k diverse molecules can help (Thomas et al, 2022; Renz et al, 2024), but do not measure well a models explorative capability if the task itself does not require much exploration.

Molecule Exploration Benchmark

The Molecule Exploration (MolExp) benchmark requires intrinsic exploration by tasking models with multiple target rediscovery. Where, the oracle is the maximum similarity of the de novo molecule to a set of target molecules \( s(x) =\max(sim(x_i,t_1),...,sim(x_i,t_N)) \), and final performance is measured by the product of the maximum similarity achieved to each target molecule. This tests a models ability to identify all high rewarding areas of chemical space which will require not getting trapped in local minima.

This benchmark consists of four tasks, and takes inspiration from the converse of the similarity principle, where dissimilar molecules nonetheless possess similar bioactivity (also leveraged to benchmark predictive models using activity cliffs (Van Tilborg et al, 2022)). This benchmark includes four tasks, each with 2-4 contextually diverse molecular targets that share similar bioactivity: anti-psychotic drugs (AP), Adenosine A\(_{2A}\) (A2A) receptor ligands, beta-secretase 1 (BACE1) inhibitors, and epidermal growth factor inhibitors (EGFR).

How do current generative models perform?

Using chemical language models (CLMs) pre-trained on ChEMBL 34 (a training dataset which includes all of the target molecules), here's how different reinforcement learning (RL) algorithms perform, trained with budget of 10,000 molecules. In this case, REINVENT does not outperform virtual screening (which randomly samples 10,000 molecules from the training dataset). The best algorithm is ACEGEN\(_{MolOpt}\) which is a REINFORCE-style algorithm hyperparameter optimised for the MolOpt benchmark (Thomas et al, 2025), however, it achieves 1.62 out of 4 which is still quite far from maximum performance.

If we look at similarity to each target during training (each curve in the plot below), we can see that even though ACEGEN\(_{MolOpt}\) is the best tested, it still only maximises similarity to a single molecular target during training, it simply does so better than the other algorithms. Highlighting that within 10,000 molecules, none of these algorithms explore enough to find all targets. We have a difficult benchmark here.

Long story short, this benchmark can be solved but requires 128 independent ACEGEN\(_{MolOpt}\) agents. For more information on the benchmark and using a population of agents to solve it, see our preprint (including a log-linear scaling law as seen in inference-time scaling, as well as, cooperative approaches 🤓).

I hope this benchmark can grow in the number and difficulty of tasks which can be proposed by the community to guide generative algorithms. Eventually realising a future where we can quickly identify all drug-like molecules within enormous chemical spaces, as we do with much smaller explicit chemical spaces via virtual screening.

Running MolExp

The MolExp benchmark is available in MolScore (Thomas et al., 2024) and can be install via PyPI.

$ pip install "MolScore>=1.9"

Two versions of the benchmark are available:

MolExp: This computes the similarity to the target molecules using the Tanimoto similarity of the ECFP4 fingerprints with bit counts.
MolExpL: This computes the similarity to the target molecules using a Levenshtein (string-edit) similarity measure between SMILES (both canonicalised). Note this is the version used in the results shown above and in the preprint, for the reason that CLMs operate in string space.

To run the benchmark on YOUR generative model, let's call it 'GM', you can use MolScore as the scoring function as shown below:

from molscore import MolScoreBenchmark
msb = MolScoreBenchmark(
  model_name='GM',
  output_dir='./results',
  benchmark='MolExpL', # Or 'MolExp'
  budget=10_000
)

for task in msb:
  while not task.finished:
    SMILES = GM.sample() # Sample molecules from your generative
    scores = task.score(SMILES) # Pass them to MolScore as SMILES to score
    # Use the scores to compute the loss and update your generative model

That's it! Benchmark results will be saved to 'results.csv' in the directory created. Note that five replicates should be run with different seeds, but that's on you.

Search This Blog

Cheminformantics