A machine learning approach for the prediction of aqueous solubility of pharmaceuticals: a comparative model and dataset analysis

被引:0
|
作者
Ghanavati, Mohammad Amin [1 ]
Ahmadi, Soroush [1 ,2 ]
Rohani, Sohrab [1 ]
机构
[1] Western Univ, Chem & Biochem Engn, London, ON N6A 5B9, Canada
[2] MIT, Dept Chem Engn, Cambridge, MA 02139 USA
来源
基金
加拿大自然科学与工程研究理事会;
关键词
IN-SILICO PREDICTION; DRUG SOLUBILITY; WATER; SOLVATION; MOLECULES; SOLVENTS;
D O I
10.1039/d4dd00065j
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2) values of 0.458, 0.613, and 0.918, respectively. Furthermore, an ensemble of the three models showed improvement in error metrics across all datasets, consistently outperforming each individual model. This Ensemble model was also tested on the Solubility Challenge 2019, achieving an RMSE of 0.865 and outperforming 37 models with an average RMSE of 1.62. Transferability analysis of our work further indicated robust performance across different datasets. Additionally, SHAP explainability for the feature-based XGBoost model provided transparency in solubility predictions, enhancing the interpretability of the results. Three ML models and their ensemble predict aqueous solubility of small organic molecules using different representations: GCN with molecular graphs, EdgeConv with ESP maps, and XGBoost with tabular features from ESP and Mordred descriptors.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset
    Deng, Chenglong
    Liang, Li
    Xing, Guomeng
    Hua, Yi
    Lu, Tao
    Zhang, Yanmin
    Chen, Yadong
    Liu, Haichun
    [J]. MOLECULAR DIVERSITY, 2023, 27 (03) : 1023 - 1035
  • [2] Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset
    Chenglong Deng
    Li Liang
    Guomeng Xing
    Yi Hua
    Tao Lu
    Yanmin Zhang
    Yadong Chen
    Haichun Liu
    [J]. Molecular Diversity, 2023, 27 : 1023 - 1035
  • [3] A hybrid approach to aqueous solubility prediction using COSMO-RS and machine learning
    Fhionnlaoich, Niamh Mac
    Zeglinski, Jacek
    Simon, Melba
    Wood, Barbara
    Davin, Sharon
    Glennon, Brian
    [J]. CHEMICAL ENGINEERING RESEARCH & DESIGN, 2024, 209 : 67 - 71
  • [4] SolTranNet-A Machine Learning Tool for Fast Aqueous Solubility Prediction
    Francoeur, Paul G.
    Koes, David R.
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (06) : 2530 - 2536
  • [5] Application of extreme learning machine for prediction of aqueous solubility of carbon dioxide
    Erfan Mohammadian
    Shervin Motamedi
    Shahaboddin Shamshirband
    Roslan Hashim
    Radzuan Junin
    Chandrabhushan Roy
    Amin Azdarpour
    [J]. Environmental Earth Sciences, 2016, 75
  • [6] Evaluation of Machine Learning Models for Aqueous Solubility Prediction in Drug Discovery
    Xue, Nian
    Zhang, Yuzhu
    Liu, Sensen
    [J]. 2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 26 - 33
  • [7] Application of extreme learning machine for prediction of aqueous solubility of carbon dioxide
    Mohammadian, Erfan
    Motamedi, Shervin
    Shamshirband, Shahaboddin
    Hashim, Roslan
    Junin, Radzuan
    Roy, Chandrabhushan
    Azdarpour, Amin
    [J]. ENVIRONMENTAL EARTH SCIENCES, 2016, 75 (03) : 1 - 11
  • [8] COMPARATIVE RISK ANALYSIS ON PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING APPROACH
    Swain, Aparimita
    Mohanty, Sachi Nandan
    Das, Ananta Chandra
    [J]. 2016 INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, AND OPTIMIZATION TECHNIQUES (ICEEOT), 2016, : 3312 - 3317
  • [9] Accurate solubility prediction with error bars for electrolytes:: A machine learning approach
    Schwaighofer, Anton
    Schroeter, Timon
    Mika, Sebastian
    Laub, Julian
    ter Laak, Antonius
    Suelzle, Detlev
    Ganzer, Ursula
    Heinrich, Nikolaus
    Mueller, Klaus-Robert
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2007, 47 (02) : 407 - 424
  • [10] Accurate solubility prediction with error bars for electrolytes: A machine learning approach
    Schroeter, Timon S.
    Schwaighofer, Anton
    Mika, Sebastian
    ter Laak, Antonius
    Suelzle, Detlev
    Heinrich, Nikolaus
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2006, 232 : 137 - 137