Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

被引:17
|
作者
Tayyebi, Arash [1 ]
Alshami, Ali S. [1 ]
Rabiei, Zeinab [2 ]
Yu, Xue [3 ]
Ismail, Nadhem [1 ]
Talukder, Musabbir Jahan [1 ]
Power, Jason [4 ]
机构
[1] Univ North Dakota, Chem Engn, Grand Forks, ND 58201 USA
[2] Univ North Dakota, Chem Dept, Grand Forks, ND 58202 USA
[3] Univ North Dakota, Energy & Environm Res Ctr, Grand Forks, ND 58202 USA
[4] Univ North Dakota, Biomed Sci, Grand Forks, ND 58202 USA
关键词
Aqueous solubility; Fingerprint; Machine learning; Random forest; SHAP; ACTIVITY-COEFFICIENTS; INTRINSIC SOLUBILITY; QSPR; MOLECULES;
D O I
10.1186/s13321-023-00752-6
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Comparison and Analysis of Prediction Models for Locomotive Traction Energy Consumption Based on the Machine Learning
    Liang, Huize
    Zhang, Yuying
    Yang, Peiyu
    Wang, Lie
    Gao, Chunlei
    IEEE ACCESS, 2023, 11 : 38502 - 38513
  • [42] Prediction of TBM operation parameters using machine learning models based on BPSO
    Wang, Yao
    Zhao, Jiong
    Jiang, Kuan
    Zhou, Qicai
    Kang, Zhenkuo
    Chen, Chuanlin
    Zhang, Heng
    ADVANCED ENGINEERING INFORMATICS, 2023, 56
  • [43] GC-MS Fingerprints Profiling Using Machine Learning Models for Food Flavor Prediction
    Bi, Kexin
    Zhang, Dong
    Qiu, Tong
    Huang, Yizhen
    PROCESSES, 2020, 8 (01)
  • [44] The COVID-19 pandemic: prediction study based on machine learning models
    Zohair Malki
    El-Sayed Atlam
    Ashraf Ewis
    Guesh Dagnew
    Osama A. Ghoneim
    Abdallah A. Mohamed
    Mohamed M. Abdel-Daim
    Ibrahim Gad
    Environmental Science and Pollution Research, 2021, 28 : 40496 - 40506
  • [45] The COVID-19 pandemic: prediction study based on machine learning models
    Malki, Zohair
    Atlam, El-Sayed
    Ewis, Ashraf
    Dagnew, Guesh
    Ghoneim, Osama A.
    Mohamed, Abdallah A.
    Abdel-Daim, Mohamed M.
    Gad, Ibrahim
    ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH, 2021, 28 (30) : 40496 - 40506
  • [46] Prediction of the solubility of organic compounds in high-temperature water using machine learning
    Osada, Mitsumasa
    Tamura, Kotaro
    Shimada, Iori
    JOURNAL OF SUPERCRITICAL FLUIDS, 2022, 190
  • [47] Prediction of CO2 solubility in aqueous and organic solvent systems through machine learning techniques
    Besharati, Zahra
    Hashemi, Seyed Hossein
    MODELING EARTH SYSTEMS AND ENVIRONMENT, 2025, 11 (01)
  • [48] PREDICTION OF AQUEOUS SOLUBILITY OF ORGANIC-CHEMICALS BASED ON MOLECULAR-STRUCTURE
    NIRMALAKHANDAN, NN
    SPEECE, RE
    ENVIRONMENTAL SCIENCE & TECHNOLOGY, 1988, 22 (03) : 328 - 338
  • [49] Machine learning assisted prediction of absorption maxima in cyclohexene: A comparison using molecular descriptors and fingerprints
    Tahir, Mudassir Hussain
    Naeem, Sumaira
    Elnaggar, Ashraf Y.
    Mahmoud, M. H. H.
    CHEMICAL PHYSICS, 2025, 588
  • [50] Diversity based imbalance learning approach for software fault prediction using machine learning models
    Manchala, Pravali
    Bisi, Manjubala
    APPLIED SOFT COMPUTING, 2022, 124