Structure-aware generative molecular design: from 2D or 3D?

TL;DR: I compared recent structure-explicit (3D) algorithms vs. structure-implicit (typically 2D) algorithms for structure-aware generative molecular design. I found that from a practical perspective structure-implicit (2D) methods are currently more useful, but, scoring function is important. See details and future outlook below, or jupyter notebook for data and analysis.

Background

Knowledge of protein structure and how a ligand binds provides critical insights enabling more explicit hypothesis of important intermolecular interactions, potentially new ligand growth vectors (or ones to avoid), or in the absence of any ligand it provides a model to test virtual molecules in silico. Therefore, structure-aware 3D generative molecular design (GMD) should capitalise in the same way, right?

For this reason, 3D-GMD (algorithms that build the ligand in 3D) and structure-explicit generative molecular design (algorithms that utilise a representation of protein structure) are a large focus of research currently. However, you can also achieve structure-guided GMD by using structure-aware scoring functions to guide GMD that is typically done in 2D. Last year I wrote a mini-review comparing these differences (Thomas M, et al., 2023) but, it is already a little out of date and doesn't include, for example, diffusion models for 3D-GMD.

 
 
A personal frustration of mine is that these two 'camps' aren't typically compared to each other in research papers, nor is the quality of chemistry generated easy to assess. I recently put together a mini comparison for a conference, to get a feel for the differences. So here it is in blog form...

What is current state of 3D-GMD?

Luckily for me, a flurry of benchmark and comparison papers on 3D-GMD have been published in the last year. So let's simply take a look at some of their conclusions.

PoseCheck (Harris C et al. arXiv, Aug 2023): “The exceedingly high strain energy [~1,200 kcal/mol] values observed in this scenario should be approached with considerable prudence. For comparison, the combustion of TNT releases approximately 815 kcal/mol.”

Does 3D dominate? (Zheng K et al. arXiv, Jun 2024): “Also, representing target molecules in 3D format does not significantly improve both the molecular quality and binding affinity.”

DrugPose
(Jocys Z et al. Digital Discovery, Jun 2024): “The evaluation of different [3D] models reveals that the current generative models have limitations in generating molecules with the desired binding mode.”

GenBench3D (Baillif B et al. ChemRxiv, Jul 2024): “We benchmarked six structure-based [3D] molecular generative models and showed that they generated mostly molecules with invalid geometries.”

CBGBench (Lin H et al. arXiv, Jul 2024): “it is worth noting that DIFFBP, FLAG, and POCKET2MOL can possibly generate molecules with small atom numbers and high LBE [ligand binding efficiency].”

POKMOL3D (Liu H et al. ChemRxiv, Aug 2024): “Overall, the performance of 3D generative models on large scale of pockets is still far from satisfactory, and polishing of network architecture is needed to improve the learning of ligand-protein interaction and generalizability of current 3D generative models to enable their application on wide range of protein pockets.”

Well I wouldn't say they are all glowing recommendations, it seems the most positive conclusion is that some models "can possibly generate molecules ... with high LBE". However, only the comparisons by Zheng et al. and Liu H et al. included a comparison to structure-implicit GMD in 2D, moreover, only the comparison by Liu H et al. also evaluated the type of chemistry generated beyond geometry and basic measures (like QED, which in my experience can be hacked). Therefore, I guide the interested reader to the comparison by Liu H et al. Nonetheless, I wanted to try some methods myself.

Experiment

In the interest of a breadth, I have selected six orthogonal GMD algorithms, three structure-explicit models based on the best performing algorithms in the above benchmarks, and three structure-implicit GMD algorithms that should prove good baselines.

Structure-explicit GMD algorithms
  • [3D] Lingo3DMol (Feng W, et al.): An auto-regressive language model that generates local coordinates of atoms conditioned on the protein pocket.
  • [3D] Pocket2Mol (Peng X, et al.): An auto-regressive graph neural network that predicts focal atom, new atom position, bond and type directly in the pocket.
  • [3D] TargetDiff (Guan J, et al.): A diffusion model that de-noises atom positions conditioned on the protein pocket.
Structure-implicit GMD algorithms
  • [2D] AHC (Thomas M et al., 2022): An auto-regressive GRU-based chemical language model pretrained on ChEMBL, with efficient reinforcement learning to optimise an objective.
  • [2D] AutoGrow 4 (Spiegel JO, et al.): Heuristic molecule building optimised by a genetic algorithm over chemistry-aware operations including crossover by substructure overlap, and mutations by chemical reactions.
  • [3D] 3D-MCTS (Du H, et al.): Heuristic molecule building in the pocket by step-wise addition of fragments optimised by a Monte-Carlo Tree Search. 
To compare methods, I generated possible Dopamine D2 Receptor (DRD2) ligands using the crystal structure PDB:6CM4 of DRD2 bound to Risperidone. 
- For structure-explicit algorithms I sampled 1,000 molecules conditioned on the pocket.
- For structure-implicit algorithms I trained them to minimise the Vina docking score for a budget of 10,000 molecules and then subset the best 1,000 by docking score. 

Qualitative analysis (A picture says a thousand words)

Well you know the saying... let's look at 3 randomly selected molecules from each approach.

Lingo3DMol
Pocket2Mol
TargetDiff
3D-MCTS
AutoGrow4
AHC
 
Ouch. I think the comp/med/synth chemists out there will agree that, without even looking at the protein interactions, none of the molecules look particularly attractive. The structure-explicit 3D-GMD methods can generate complex (or even likely impossible) caged structures. The most promising molecules are probably those from 3D-MCTS with ribose like fragments which aren't a million miles away from the native ligand Dopamine with the dihydroxybenzene moiety. However, the algorithm converges only to these fragments. While AutoGrow4 and AHC also tend to generate lipophilic, steroidal looking ring structures which aren't appeasing either. Hmm ...

Quantitative analysis I (Geometry comparison)

To investigate geometry I tested the 1,000 molecules from each algorithm with PoseCheck (Harris C, et al., 2023) which calculates,
  1. The number of steric clashes i.e., instances two neutral atoms are closer than their combined van de Vaals radii which would be energetically unfavourable.
  2. The ligand strain energy i.e., the energy difference between the generated pose and a conformationally relaxed ligand without the protein, calculated using the Universal Force Field (UFF).
  3. The interaction similarity i.e., the Tanimoto similarity of the protein-ligand interactions compared to the reference ligand, in this case Risperidone.
 
As expected, the models which use Vina to embed and place molecules in the pocket results in fewer steric clashes. This shows there is still some room for improvement in structure-explicit 3D-GMD to bridge the gap to physics-based force-fields.
Likewise, the Vina embedded molecules have much lower ligand strain energy and there is still much needed improvement in the ligand strain of structure-explicit 3D-GMD, especially considering that the energy is on a log-scale. I want to note that although Pocket2Mol showed the best ligand strain profile, around 25% of molecules failed minimisation and therefore, the plot may overestimate performance.
What we can see here is that actually TargetDiff and AHC generate molecules with the most similar interaction fingerprint profile to the native ligand. Not bad TargetDiff. On average, I think the structure-explicit methods here do slightly better than the implicit counterparts.

Quantitative analysis II (Chemical quality comparison)

Now for the part often missing in machine learning conference papers. To quickly assess the quality of chemistry generated I calculated a few metrics using MolScore (Thomas M, et al., 2024):
  1. Proportion of unique scaffolds.
  2. Proportion of molecules that pass chemistry filters. This includes medicinal chemistry filters and PAINS filters that discard reactive, problematic, or promiscuous substructures as well as ensuring molecules are in a sensible property space (e.g., molecular weight, LogP range, and number of rotatable bonds).
  3. Average number of outlier ECFP fingerprint bits that aren't present in a reference dataset (here ChEMBL). This is a proxy for idiosyncratic atomic environments.
  4. Average single-nearest neighbour similarity to known D2 ligands.

Most striking is that only 20-30% of molecules from structure-explicit 3D-GMD pass the molecular quality filters or don't particularly generate a diverse range of scaffolds (aside from TargetDiff). However, there isn't much overall difference when we look at similarity to known D2 ligands which suggests none of the algorithms are generating molecules in the 'retrospectively correct' chemical space. 

Disappointingly, AHC also performs poorly in generating molecules that pass molecular quality filters. However, if we look at the docking scores that were optimised we can see that AHC also learns to optimise the Vina score much more than the alternative structure-implicit algorithms. I believe this is a combination of the broad generative domain of AHC and learning efficiency allowing it to access more Vina favourable chemical space than the other more chemically restrictive generative algorithms (that conduct molecule building via hard-coded rules). More Vina favourable chemical space means larger, greasier molecules that provide good scores. For these reasons, I've never tested a system where Vina is a useful scoring function to guide GMD - perhaps a topic for another post.

For the sake of curiosity, I repeated the experiment but using rDock instead, which utilises a broader balance of physics-based interaction terms in its scoring function. The new analysis below shows an increase from around 10% of molecules passing molecular quality filters to around 90%. Moreover, this approach now generates the most similar molecules to known D2 ligands on average.


Concluding remarks

GMD from 2D or 3D? Based on the 3D structure-explicit methods I see published, and those I briefly tested here, it is my opinion that 2D structure-implicit methods are still much more practical. With these 3D methods, you lose ~80% of molecules that don't pass molecular quality filters and probably the rest due to protein clashes and high ligand strain. To play my own devil's advocate, most benchmarks and approaches apply a post-hoc minimisation of 3D generated ligands which will fix many of the issues, furthermore, you can simply only consider the 'good' molecules that pass the filters. For me this boils down to model trust, do I trust a model that hasn't properly learned the prerequisites of good ligand geometry and regularization to sensible chemical space? No. However, I think good progress is being made in this data-limited domain.
 
Where are 3D methods going? To think about where 3D methods are going, let's take a look at two recent papers. 

1) TacoGFN (Shen T, et al.): This GFlowNet for structure-based drug design was recently published and compared in the camp of structure-explicit 3D methods. However, the generative model actually utilises a 2D representation of ligands "Since a molecule’s desirability as a drug in the real world is independent of its predicted 3D conformation, the 2D representation of a ligand here is appropriate.". I think it's interesting to see methods going back to 2D ligand representations to achieve SOTA. More generally, I think we will see (and require) more utilisation of broader ligand chemical spaces that haven't been co-crystallised with a protein. I don't mean augmented datasets like cross-docked either which may introduce more noise than signal. I mean fine-tuning or regularization on non-protein-conditioned ligand molecules in 3D. Otherwise, the data we have publicly available is just too limited. P.S. It is probably more appropriate for the authors to compare this method to a pocket-conditioned RNN, for example, the one by Xu M et al. or any other of the protein-conditioned 2D generative models that have come before it.

2) PILOT (Cremer J, et al.): This is a pocket-conditioned diffusion model that generates molecules directly in 3D in a structure-explicit manner. In particular, importance sampling is used to guide diffusion with several oracles including the docking score. Hence, structure-implicit scoring functions were used here to improve the results. I should mention here that TacoGFN also utilised a docking predictor. So it seems structure-implicit scoring functions are being integrated again. I think this is a necessary step due to the sheer sparsity of data we have for purely distribution learning models. If this is the trajectory we are heading, it means we are effectively heading towards a situation where we are simply skipping the conformational search step of docking. I think that's pretty useful from a practical, time-saving point of view. However, it's possibly less exciting than we hoped for, but probably a necessary intermediate step on the way to one-shot, drug-like, synthesizable, high binding affinity molecules.

Importance of structure-based scoring functions. In this blog the difference between using Vina versus rDock was vast with respect to the quality of chemistry generate. I think this importance in the structure-based scoring function used is pivotal in the short-term, especially since recent trends show further uptake in classical docking softwares etc. Not to mention the applicability domain of any ML-based scoring function can be problematic. The way I see it, goal-directed generative models can't be blamed for hacking imperfect scoring functions, therefore, I would rather have scoring functions I somewhat understand over black-box functions. Right now, I understand better the limitations of docking, than a black-box ML model. This enables me to plan and implement strategies to mitigate the limitations of docking. Lastly, it helps if I can trust the generative model to at least generate reasonable chemistry (and geometry) such that I don't have to include this within my scoring function.

References

Baillif, B., Cole, J., McCabe, P., & Bender, A. (2024). Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters. arXiv preprint arXiv:2407.04424.
Cremer, J., Le, T., Noé, F., Clevert, D. A., & Schütt, K. T. (2024). PILOT: Equivariant diffusion for pocket conditioned de novo ligand generation with multi-objective guidance via importance sampling. arXiv preprint arXiv:2405.14925.
Du, H., Jiang, D., Zhang, O., Wu, Z., Gao, J., Zhang, X., ... & Hou, T. (2023). A flexible data-free framework for structure-based de novo drug design with reinforcement learning. Chemical Science, 14(43), 12166-12181.
Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., ... & Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6(1), 62-73.
Guan, J., Qian, W. W., Peng, X., Su, Y., Peng, J., & Ma, J. (2023). 3d equivariant diffusion for target-aware molecule generation and affinity prediction. arXiv preprint arXiv:2303.03543.
Harris, C., Didi, K., Jamasb, A., Joshi, C., Mathis, S., Lio, P., & Blundell, T. (2023). Posecheck: Generative models for 3D structure-based drug design produce unrealistic poses. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.
Jocys, Z., Grundy, J., & Farrahi, K. (2024). DrugPose: benchmarking 3D generative methods for early-stage drug discovery. Digital Discovery.
Lin, H., Zhao, G., Zhang, O., Huang, Y., Wu, L., Liu, Z., ... & Li, S. Z. (2024). CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph. arXiv preprint arXiv:2406.10840.
Liu, H., Niu, Z., Qin, Y., Xu, M., Wu, J., Xiao, X., ... & Chen, H. (2024). How good are current pocket based 3D generative models?: The benchmark set and evaluation on protein pocket based 3D molecular generative models. chemRxiv preprint 10.26434/chemrxiv-2024-2qgpb.
Peng, X., Luo, S., Guan, J., Xie, Q., Peng, J., & Ma, J. (2022, June). Pocket2mol: Efficient molecular sampling based on 3d protein pockets. In International Conference on Machine Learning (pp. 17644-17655). PMLR.
Shen, T., Pandey, M., & Ester, M. TacoGFN: Target Conditioned GFlowNet for Drug Design. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.
Spiegel, J. O., & Durrant, J. D. (2020). AutoGrow4: an open-source genetic algorithm for de novo drug design and lead optimization. Journal of Cheminformatics, 12, 1-16.
Thomas, M., M., Bender, A., & de Graaf, C. (2023). Integrating structure-based approaches in generative molecular design. Current Opinion in Structural Biology, 79, 102559.
Thomas, M., O’Boyle, N. M., Bender, A., & De Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of Cheminformatics, 14(1), 68.
Thomas, M., O’Boyle, N. M., Bender, A., & De Graaf, C. (2024). MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design. Journal of Cheminformatics, 16(1), 64.
Xu, M., Ran, T., & Chen, H. (2021). De novo molecule design through the molecular generative model conditioned by 3D information of protein binding sites. Journal of Chemical Information and Modeling, 61(7), 3240-3254.
Zheng, K., Lu, Y., Zhang, Z., Wan, Z., Ma, Y., Zitnik, M., & Fu, T. (2024). Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?. arXiv preprint arXiv:2406.03403

Comments