A Natural Language Processing (NLP)-based deep learning approach to predict solubility parameters for drug discovery Jiayun Pang University of Greenwich, UK Deep learning techniques are now routinely used to understand huge volumes of raw text and have revolutionised the field of Natural Language Processing (NLP). The deep learning NLP techniques are increasingly applied to other domains where the data has a similarity with text. One example is SMILES (SimplifiedMolecularInputLineEntrySystem) in chemistry and biology, a form of line notation to describe molecular structure using a string of chemical elements and symbols. Through SMILES, it is possible to adopt several powerful NLP algorithms to process molecular structures for property prediction and novel structure generation. SMILES-based deep learning models are emerging as an important research tool in the ongoing data-driven revolution of chemical and biological research. Solubility is an important factor to consider for drug discovery and pharmaceutical formulations.Predicting solubility of molecules can be a challenging task as it depends on various physicochemical factors. This is reflected by the widely used Hansen solubility parameters (HSPs), which has three components δ d , δ p , δ h , accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule. These parameters were designed to better understand how molecular structure affects solubility and have many applications in pharmaceutical and chemical industries. While experimental methods can be used to determine HSPs, it is not feasible for the vast number of hypothetical molecules routinely used for virtual screening in drug discovery. Several theoretical approaches have been developed to determine HSPs, but they usually involve computing of chemical descriptors which requires expert knowledge of the molecules and can be time-consuming. There is a need to explore new ways to predict HSPs and more broadly understand solubility from a molecular structure- based and data-driven perspective that would be more rigorous and efficient. In the current study, we aim to develop a new way to predict HSPs based on only the SMILES of molecules utilising transfer learning and the so-called embedding approach widely applied in NLP . [1] Two deep learning models - Word2Vec model [2] and BERT-based fine-tuning model [3] were trained based on tens of millions of molecules from databases such as ZINC and ChEMBL. A dataset of ~1200 organic molecules with experimentally determined HSPs was used as labels. The data was randomly split into 80/20/20 for training, validating, and testing. The accuracy of the models was assessed based on RMSEs, MAEs and R 2 with the experimental HSPs. All in all, the deep learning models demonstrate higher predictive powers over the baseline Morgan fingerprint model and give predictions on a par with the more computing intensive chemical descriptors-based models. This new data-driven approach could improve the computational efficiency of predicting HSPs for virtual screening of drug candidates and the design of pharmaceutical formulations.
Figure: (a) Visualisation of 3-D Cartesian plane showing the three different components of HSPs. (b) An example of the predicted HSPs in a scatter plot against experimental HSPs (R 2 = 0.8) References 1. Cova, Frontiers in Chemistry, 2019, DOI: 10.3389/fchem.2019.00809 2. Jaeger, J. Chem. Inf. Model. 2018, 58, 27-35. DOI:10.1021/acs.jcim.7b00616 3. Chithrananda, arXiv preprint 2020, arXiv.2010.09885
DC05
© The Author(s), 2023
Made with FlippingBook Learn more on our blog