Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

被引:17
|
作者
Tayyebi, Arash [1 ]
Alshami, Ali S. [1 ]
Rabiei, Zeinab [2 ]
Yu, Xue [3 ]
Ismail, Nadhem [1 ]
Talukder, Musabbir Jahan [1 ]
Power, Jason [4 ]
机构
[1] Univ North Dakota, Chem Engn, Grand Forks, ND 58201 USA
[2] Univ North Dakota, Chem Dept, Grand Forks, ND 58202 USA
[3] Univ North Dakota, Energy & Environm Res Ctr, Grand Forks, ND 58202 USA
[4] Univ North Dakota, Biomed Sci, Grand Forks, ND 58202 USA
关键词
Aqueous solubility; Fingerprint; Machine learning; Random forest; SHAP; ACTIVITY-COEFFICIENTS; INTRINSIC SOLUBILITY; QSPR; MOLECULES;
D O I
10.1186/s13321-023-00752-6
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
    Arash Tayyebi
    Ali S Alshami
    Zeinab Rabiei
    Xue Yu
    Nadhem Ismail
    Musabbir Jahan Talukder
    Jason Power
    Journal of Cheminformatics, 15
  • [2] Molecular Fingerprints-Based Machine Learning for Metabolic Profiling
    Sirocchi, Christel
    Biancucci, Federica
    Suffian, Muhammad
    Benedetti, Riccardo
    Donati, Matteo
    Ferretti, Stefano
    Bogliolo, Alessandro
    Magnani, Mauro
    Menotta, Michele
    Montagna, Sara
    MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT IV, 2025, 2136 : 103 - 111
  • [3] Prediction of CO2 solubility in pyridinium-based ionic liquids implementing new descriptor-based chemoinformatics models
    Valeh-e-Sheyda, Peyvand
    Faridi Masouleh, Marzieh
    Zarei-Kia, Parisa
    Fluid Phase Equilibria, 2021, 546
  • [4] Prediction of CO2 solubility in pyridinium-based ionic liquids implementing new descriptor-based chemoinformatics models
    Valeh-e-Sheyda, Peyvand
    Masouleh, Marzieh Faridi
    Zarei-Kia, Parisa
    FLUID PHASE EQUILIBRIA, 2021, 546
  • [5] On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2-Applicability Domain and Outliers
    Trinh, Cindy
    Lasala, Silvia
    Herbinet, Olivier
    Meimaroglou, Dimitrios
    ALGORITHMS, 2023, 16 (12)
  • [6] To predict the left ventricular endocardial scar tissue pattern using Radon descriptor-based machine learning
    Yashbir Singh
    Shadi Atalla
    Wathiq Mansoor
    Rahul Paul
    Deepa Deepa
    BMC Research Notes, 16 (1)
  • [7] To predict the left ventricular endocardial scar tissue pattern using Radon descriptor-based machine learning
    Singh, Yashbir
    Atalla, Shadi
    Mansoor, Wathiq
    Paul, Rahul
    Deepa, Deepa
    BMC RESEARCH NOTES, 2023, 16 (01)
  • [8] Efficiency Prediction for Organic Photovoltaic Cells Using Molecular Fingerprints and Machine Learning Regression Models
    Zheng Y.
    Liang X.
    Zhang Q.
    Sun W.
    Shi T.
    Du J.
    Sun K.
    Cailiao Daobao/Materials Reports, 2021, 35 (08): : 8207 - 8212
  • [9] Evaluation of Machine Learning Models for Aqueous Solubility Prediction in Drug Discovery
    Xue, Nian
    Zhang, Yuzhu
    Liu, Sensen
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND BIG DATA, ICAIBD 2024, 2024, : 26 - 33
  • [10] Radon descriptor-based machine learning using CT images to predict the fat tissue on left atrium in the heart
    Deepa, Deepa
    Singh, Yashbir
    Hu, Weichih
    Wang, Ming Chen
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART H-JOURNAL OF ENGINEERING IN MEDICINE, 2022, 236 (08) : 1232 - 1237