On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2-Applicability Domain and Outliers

被引:0
|
作者
Trinh, Cindy [1 ]
Lasala, Silvia [1 ]
Herbinet, Olivier [1 ]
Meimaroglou, Dimitrios [1 ]
机构
[1] Univ Lorraine, CNRS, LRGP, F-54001 Nancy, France
关键词
machine learning; QSPR/QSAR; high-dimensional data; descriptors; thermodynamic properties; applicability domain; outlier detection; SIMULTANEOUS VARIABLE SELECTION; QUANTITATIVE STRUCTURE-ACTIVITY; APPLICABILITY DOMAIN; QSAR MODELS; IDENTIFICATION; CHEMISTRY; SPACE; SET;
D O I
10.3390/a16120573
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE and RMSE (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).
引用
收藏
页数:46
相关论文
共 23 条
  • [1] On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1-From Data Collection to Model Construction: Understanding of the Methods and Their Effects
    Trinh, Cindy
    Tbatou, Youssef
    Lasala, Silvia
    Herbinet, Olivier
    Meimaroglou, Dimitrios
    PROCESSES, 2023, 11 (12)
  • [2] Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
    Tayyebi, Arash
    Alshami, Ali S.
    Rabiei, Zeinab
    Yu, Xue
    Ismail, Nadhem
    Talukder, Musabbir Jahan
    Power, Jason
    JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
  • [3] Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
    Arash Tayyebi
    Ali S Alshami
    Zeinab Rabiei
    Xue Yu
    Nadhem Ismail
    Musabbir Jahan Talukder
    Jason Power
    Journal of Cheminformatics, 15
  • [4] Electronic Properties of CrB/Co2CO2 Superlattices by Multiple Descriptor-Based Machine Learning Combined with First-Principles
    Yuan, Yuanyuan
    Ren, Junqiang
    Xue, Hongtao
    Li, Junchen
    Tang, Fuling
    Guo, Xin
    Lu, Xuefeng
    SMALL METHODS, 2024, 8 (08):
  • [5] Estimation of the applicability domain of kernel-based machine learning models for virtual screening
    Nikolas Fechner
    Andreas Jahn
    Georg Hinselmann
    Andreas Zell
    Journal of Cheminformatics, 2
  • [6] Estimation of the applicability domain of kernel-based machine learning models for virtual screening
    Fechner, Nikolas
    Jahn, Andreas
    Hinselmann, Georg
    Zell, Andreas
    JOURNAL OF CHEMINFORMATICS, 2010, 2
  • [7] Development of machine learning based prediction models for hazardous properties of chemical mixtures
    Jiao, Zeren
    Ji, Chenxi
    Yuan, Shuai
    Zhang, Zhuoran
    Wang, Qingsheng
    JOURNAL OF LOSS PREVENTION IN THE PROCESS INDUSTRIES, 2020, 67
  • [8] Development and Evaluation of Machine Learning Based Predictive Models for Tribological Properties of Blended Coatings at Elevated Temperature
    Jagadesh Kumar Jatavallabhula
    Shabana Shabana
    Bridjesh Pappula
    Journal of Bio- and Tribo-Corrosion, 2025, 11 (1)
  • [9] Machine-Learning-Based Emission Models in Gasoline Powertrains-Part 2: Virtual Carbon Monoxide
    Kempema, Nathan J.
    Sharpe, Conner
    Wu, Xiao
    Shahabi, Mehrdad
    Kubinski, David
    SAE INTERNATIONAL JOURNAL OF ENGINES, 2023, 16 (06) : 799 - 807
  • [10] Development and application of machine learning models for prediction of soil available cadmium based on soil properties and climate features
    Yang, Zhihui
    Xia, Hui
    Guo, Ziyun
    Xie, Yanyan
    Liao, Qi
    Yang, Weichun
    Li, Qingzhu
    Dong, Chunhua
    Si, Mengying
    ENVIRONMENTAL POLLUTION, 2024, 355