Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

被引:17
|
作者
Tayyebi, Arash [1 ]
Alshami, Ali S. [1 ]
Rabiei, Zeinab [2 ]
Yu, Xue [3 ]
Ismail, Nadhem [1 ]
Talukder, Musabbir Jahan [1 ]
Power, Jason [4 ]
机构
[1] Univ North Dakota, Chem Engn, Grand Forks, ND 58201 USA
[2] Univ North Dakota, Chem Dept, Grand Forks, ND 58202 USA
[3] Univ North Dakota, Energy & Environm Res Ctr, Grand Forks, ND 58202 USA
[4] Univ North Dakota, Biomed Sci, Grand Forks, ND 58202 USA
关键词
Aqueous solubility; Fingerprint; Machine learning; Random forest; SHAP; ACTIVITY-COEFFICIENTS; INTRINSIC SOLUBILITY; QSPR; MOLECULES;
D O I
10.1186/s13321-023-00752-6
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Machine learning-based prediction of compound profiling matrices
    Perez, Raquel Rodriguez
    Bajorath, Jurgen
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2019, 257
  • [32] Development of Machine Learning Based Blood-brain Barrier Permeability Prediction Models Using Physicochemical Properties, MACCS and Substructure Fingerprints
    Saxena, Deeksha
    Sharma, Anju
    Siddiqui, Mohammed Haris
    Kumar, Rajnish
    CURRENT BIOINFORMATICS, 2021, 16 (06) : 855 - 864
  • [33] A hybrid approach to aqueous solubility prediction using COSMO-RS and machine learning
    Fhionnlaoich, Niamh Mac
    Zeglinski, Jacek
    Simon, Melba
    Wood, Barbara
    Davin, Sharon
    Glennon, Brian
    CHEMICAL ENGINEERING RESEARCH & DESIGN, 2024, 209 : 67 - 71
  • [34] On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1-From Data Collection to Model Construction: Understanding of the Methods and Their Effects
    Trinh, Cindy
    Tbatou, Youssef
    Lasala, Silvia
    Herbinet, Olivier
    Meimaroglou, Dimitrios
    PROCESSES, 2023, 11 (12)
  • [35] Application of Bioactivity Profile-Based Fingerprints for Building Machine Learning Models
    Sturm, Noe
    Sun, Jiangming
    Vandriessche, Yves
    Mayr, Andreas
    Klambauer, Guenter
    Carlsson, Lars
    Engkvist, Ola
    Chen, Hongming
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (03) : 962 - 972
  • [36] Fast and accurate prediction of partial charges using Atom-Path-Descriptor-based machine learning
    Wang, Jike
    Cao, Dongsheng
    Tang, Cunchen
    Chen, Xi
    Sun, Huiyong
    Hou, Tingjun
    BIOINFORMATICS, 2020, 36 (18) : 4721 - 4728
  • [37] Prediction of "bad postures" based on Machine Learning models
    Gomez Mendoza, Luis Fernando
    Huainan Vizconde, Sofia
    Castillo Sequera, Jose Luis
    Rosales Huamani, Jimmy Aurelio
    2022 8TH INTERNATIONAL ENGINEERING, SCIENCES AND TECHNOLOGY CONFERENCE, IESTEC, 2022, : 208 - 214
  • [38] Machine learning based models for Cardiovascular risk prediction
    Rajliwall, Nitten S.
    Davey, Rachel
    Chetty, Girija
    2018 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND DATA ENGINEERING (ICMLDE 2018), 2018, : 142 - 148
  • [39] Machine learning-based prediction models in neurosurgery
    Habashy, Karl J.
    Arrieta, Victor A.
    Feghali, James
    NEUROSURGICAL FOCUS, 2023, 55 (03)
  • [40] Prediction of aerodynamic coefficients based on machine learning models
    Elshewey, Ahmed M.
    Aziz, Mohamed A.
    Gaheen, Osama A.
    Sawah, Mohamed S.
    Abd ELhamid, A.
    Osman, Ahmed M.
    MODELING EARTH SYSTEMS AND ENVIRONMENT, 2025, 11 (03)