Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models

被引:8
|
作者
Conn, Jonathan G. M. [1 ]
Carter, James W. [1 ]
Conn, Justin J. A. [1 ]
Subramanian, Vigneshwari [2 ,3 ]
Baxter, Andrew [4 ]
Engkvist, Ola [5 ,6 ]
Llinas, Antonio [2 ]
Ratkova, Ekaterina L. [5 ]
Pickett, Stephen D. [7 ]
McDonagh, James L. [8 ]
Palmer, David S. [1 ]
机构
[1] Univ Strathclyde, Dept Pure & Appl Chem, Glasgow G1 1XL, Scotland
[2] AstraZeneca, BioPharmaceut R&D, Drug Metab & Pharmacokinet, Res & Early Dev,Resp & Immunol, SE-43183 Gothenburg, Sweden
[3] AstraZeneca, R&D, Imaging & Data Analyt, Clin Pharmacol & Safety Sci, Pepparedsleden 1, SE-43183 Gothenburg, Sweden
[4] GSK Med Res Ctr, Stevenage SG1 2NY, England
[5] AstraZeneca, BioPharmaceut R&D, Res & Early Dev, Cardiovasc Renal & Metab CVRM,Med Chem, SE-43150 Gothenburg, Sweden
[6] Chalmers Univ Technol, Dept Comp Sci & Engn, SE-41296 Gothenburg, Sweden
[7] GlaxoSmithKline R&D Pharmaceut, Computat Sci, Stevenage SG1 2NY, England
[8] SciTech Daresbury, Hartree Ctr, IBM Res Europe, Warrington WA4 4AD, Cheshire, England
基金
英国工程与自然科学研究理事会;
关键词
INTRINSIC AQUEOUS SOLUBILITY; MOLECULES;
D O I
10.1021/acs.jcim.2c01189
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge " in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.
引用
收藏
页码:1099 / 1113
页数:15
相关论文
共 50 条
  • [21] Identification of Polymers with a Small Data Set of Mid-infrared Spectra: A Comparison between Machine Learning and Deep Learning Models
    Tian, Xin
    Been, Frederic
    Sun, Yiqun
    van Thienen, Peter
    Bauerlein, Patrick S.
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY LETTERS, 2023, 10 (11) : 1030 - 1035
  • [22] Energy Performance Analysis of Photovoltaic Integrated with Microgrid Data Analysis Using Deep Learning Feature Selection and Classification Techniques
    Qaiyum, Sana
    Margala, Martin
    Kshirsagar, Pravin R. R.
    Chakrabarti, Prasun
    Irshad, Kashif
    [J]. SUSTAINABILITY, 2023, 15 (14)
  • [23] View VULMA: Data Set for Training a Machine-Learning Tool for a Fast Vulnerability Analysis of Existing Buildings
    Cardellicchio, Angelo
    Ruggieri, Sergio
    Leggieri, Valeria
    Uva, Giuseppina
    [J]. DATA, 2022, 7 (01)
  • [24] DDBJ Data Analysis Challenge: a machine learning competition to predict Arabidopsis chromatin feature annotations from DNA sequences
    Kaminuma, Eli
    Baba, Yukino
    Mochizuki, Masahiro
    Matsumoto, Hirotaka
    Ozaki, Haruka
    Okayama, Toshitsugu
    Kato, Takuya
    Oki, Shinya
    Fujisawa, Takatomo
    Nakamura, Yasukazu
    Arita, Masanori
    Ogasawara, Osamu
    Kashima, Hisashi
    Takagi, Toshihisa
    [J]. GENES & GENETIC SYSTEMS, 2020, 95 (01) : 43 - 50
  • [25] Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis
    Hao Huang
    Chunlei Liu
    Manoj M. Wagle
    Pengyi Yang
    [J]. Genome Biology, 24
  • [26] Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis
    Huang, Hao
    Liu, Chunlei
    Wagle, Manoj M.
    Yang, Pengyi
    [J]. GENOME BIOLOGY, 2023, 24 (01)
  • [27] Role of Simulated Lidar Data for Training 3D Deep Learning Models: An Exhaustive Analysis
    Lohani, Bharat
    Khan, Parvej
    Kumar, Vaibhav
    Gupta, Siddhartha
    [J]. JOURNAL OF THE INDIAN SOCIETY OF REMOTE SENSING, 2024, 52 (09) : 2003 - 2019
  • [28] Post-hoc modification of linear models: Combining machine learning with domain information to make solid inferences from noisy data
    van Vliet, Marijn
    Salmelin, Riitta
    [J]. NEUROIMAGE, 2020, 204
  • [29] Feature averaging of historical meteorological data with machine and deep learning assist wind farm power performance analysis and forecasts
    David A. Wood
    [J]. Energy Systems, 2023, 14 : 1023 - 1049
  • [30] Feature selection with a deep learning based high-performance computing model for traffic flow analysis of Twitter data
    Mounica, B.
    Lavanya, K.
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (13): : 15107 - 15122