Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

被引:17
|
作者
Tayyebi, Arash [1 ]
Alshami, Ali S. [1 ]
Rabiei, Zeinab [2 ]
Yu, Xue [3 ]
Ismail, Nadhem [1 ]
Talukder, Musabbir Jahan [1 ]
Power, Jason [4 ]
机构
[1] Univ North Dakota, Chem Engn, Grand Forks, ND 58201 USA
[2] Univ North Dakota, Chem Dept, Grand Forks, ND 58202 USA
[3] Univ North Dakota, Energy & Environm Res Ctr, Grand Forks, ND 58202 USA
[4] Univ North Dakota, Biomed Sci, Grand Forks, ND 58202 USA
关键词
Aqueous solubility; Fingerprint; Machine learning; Random forest; SHAP; ACTIVITY-COEFFICIENTS; INTRINSIC SOLUBILITY; QSPR; MOLECULES;
D O I
10.1186/s13321-023-00752-6
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] The Importance of Descriptor-Based Clusterization in QSAR Models Development: Tyrosine Kinases Inhibitors as a Key Study
    Marzaro, Giovanni
    Tonus, Francesca
    Brun, Paola
    Castagliuolo, Ignazio
    Guiotto, Adriano
    Chilin, Adriana
    MOLECULAR INFORMATICS, 2011, 30 (08) : 721 - 732
  • [22] Stokes shift prediction of fluorescent organic dyes using machine learning based hybrid cascade models
    Mahato, Kapil Dev
    Das, S. S. Gourab Kumar
    Azad, Chandrashekhar
    Kumar, Uday
    DYES AND PIGMENTS, 2024, 222
  • [23] Prediction of formation energies of UCr4C4-type compounds from Magpie feature descriptor-based machine learning approaches
    Zhou Y.
    Gao J.
    Gui Y.
    Wen J.
    Wang Y.
    Huang X.
    Cheng J.
    Liu Q.
    Wang Q.
    Wei C.
    Optical Materials: X, 2022, 16
  • [24] Analysis of Classification Models Based on Cuisine Prediction Using Machine Learning
    Jayaraman, Shobhna
    Choudhury, Tanupriya
    Kumar, Praveen
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON SMART TECHNOLOGIES FOR SMART NATION (SMARTTECHCON), 2017, : 1485 - 1490
  • [25] Prediction of baking quality using machine learning based intelligent models
    Hilal Isleroglu
    Selami Beyhan
    Heat and Mass Transfer, 2020, 56 : 2045 - 2055
  • [26] Prediction of baking quality using machine learning based intelligent models
    Isleroglu, Hilal
    Beyhan, Selami
    HEAT AND MASS TRANSFER, 2020, 56 (07) : 2045 - 2055
  • [27] A review on gender identification using machine learning based on fingerprints
    Yadav, Jitendra Singh
    Gupta, Amit Kumar
    Saxena, Arjit
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2019, 40 (05): : 1121 - 1129
  • [28] Designing Fourier descriptor-based geometric models for object interpretation in medical images using genetic algorithms
    Delibasis, K
    Undrill, PE
    Cameron, GG
    COMPUTER VISION AND IMAGE UNDERSTANDING, 1997, 66 (03) : 286 - 300
  • [29] Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method
    Wegner, JK
    Zell, A
    JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 2003, 43 (03): : 1077 - 1084
  • [30] Effect of Increasing the Descriptor Set on Machine Learning Prediction of Small Molecule-Based Organic Solar Cells
    Zhao, Zhi-Wen
    del Cueto, Marcos
    Geng, Yun
    Troisi, Alessandro
    CHEMISTRY OF MATERIALS, 2020, 32 (18) : 7777 - 7787