Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models

被引：8

作者：

Conn, Jonathan G. M. ^{[1
]}

Carter, James W. ^{[1
]}

Conn, Justin J. A. ^{[1
]}

Subramanian, Vigneshwari ^{[2
,3
]}

Baxter, Andrew ^{[4
]}

Engkvist, Ola ^{[5
,6
]}

Llinas, Antonio ^{[2
]}

Ratkova, Ekaterina L. ^{[5
]}

Pickett, Stephen D. ^{[7
]}

McDonagh, James L. ^{[8
]}

Palmer, David S. ^{[1
]}

机构：

[1] Univ Strathclyde, Dept Pure & Appl Chem, Glasgow G1 1XL, Scotland

[2] AstraZeneca, BioPharmaceut R&D, Drug Metab & Pharmacokinet, Res & Early Dev,Resp & Immunol, SE-43183 Gothenburg, Sweden

[3] AstraZeneca, R&D, Imaging & Data Analyt, Clin Pharmacol & Safety Sci, Pepparedsleden 1, SE-43183 Gothenburg, Sweden

[4] GSK Med Res Ctr, Stevenage SG1 2NY, England

[5] AstraZeneca, BioPharmaceut R&D, Res & Early Dev, Cardiovasc Renal & Metab CVRM,Med Chem, SE-43150 Gothenburg, Sweden

[6] Chalmers Univ Technol, Dept Comp Sci & Engn, SE-41296 Gothenburg, Sweden

[7] GlaxoSmithKline R&D Pharmaceut, Computat Sci, Stevenage SG1 2NY, England

[8] SciTech Daresbury, Hartree Ctr, IBM Res Europe, Warrington WA4 4AD, Cheshire, England

来源：

JOURNAL OF CHEMICAL INFORMATION AND MODELING | 2023年 / 63卷 / 04期

基金：

英国工程与自然科学研究理事会;

关键词：

INTRINSIC AQUEOUS SOLUBILITY; MOLECULES;

D O I：

10.1021/acs.jcim.2c01189

中图分类号：

R914 [药物化学];

学科分类号：

100701 ;

摘要：

Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge " in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.

引用

页码：1099 / 1113

页数：15

共 49 条

[1] Visual Analysis of Spatiotemporal Data Predictions with Deep Learning Models
Son, Hyesook
Kim, Seokyeon
Yeon, Hanbyul
Kim, Yejin
Jang, Yun
Kim, Seung-Eock
[J]. APPLIED SCIENCES-BASEL, 2021, 11 (13):
[2] Machine learning and feature selection for the analysis of Alzheimer Metabolomics Data
Belacel, Nabil
Cuperlovic-Culf, Miroslava
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE (ICPRAI 2018), 2018, : 222 - 226
[3] On the Interpretability of Machine Learning Models and Experimental Feature Selection in Case of Multicollinear Data
Drobnic, Franc
Kos, Andrej
Pustisek, Matevz
[J]. ELECTRONICS, 2020, 9 (05):
[4] Feature-Selection-Based Ransomware Detection with Machine Learning of Data Analysis
Wan, Yu-Lun
Chang, Jen-Chun
Chen, Rong-Jaye
Wang, Shiuh-Jeng
[J]. PROCEEDINGS OF 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS), 2018, : 85 - 88
[5] Sentiment Analysis of Financial Textual data Using Machine Learning and Deep Learning Models
Ahmad H.O.
Umar S.U.
[J]. Informatica (Slovenia), 2023, 47 (05): : 153 - 158
[6] Data preprocessing and feature selection techniques in gait recognition: A comparative study of machine learning and deep learning approaches
Parashar, Anubha
Parashar, Apoorva
Ding, Weiping
Shabaz, Mohammad
Rida, Imad
[J]. PATTERN RECOGNITION LETTERS, 2023, 172 : 65 - 73
[7] Improved Microarray Data Analysis using Feature Selection Methods with Machine Learning Methods
Sun, Jing
Passi, Kalpdrum
Jain, Chakresh Kumar
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 1527 - 1534
[8] Brain Neural Data Analysis Using Machine Learning Feature Selection and Classification Methods
Bozhkov, Lachezar
Georgieva, Petia
Trifonov, Roumen
[J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS (EANN 2014), 2014, 459 : 123 - 132
[9] Analysis of Feature Selection Approaches in Large Scale Cyber Intelligence Data with Deep Learning
Ahmetoglu, Huseyin
Das, Resul
[J]. 2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
[10] Learning and feature selection using the set covering machine with data-dependent rays on gene expression profiles
Kestler, Hans A.
Lindner, Wolfgang
Mueller, Andre
[J]. ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, PROCEEDINGS, 2006, 4087 : 286 - 297

← 1 2 3 4 5 →