On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2-Applicability Domain and Outliers

被引：0

作者：

Trinh, Cindy ^{[1
]}

Lasala, Silvia ^{[1
]}

Herbinet, Olivier ^{[1
]}

Meimaroglou, Dimitrios ^{[1
]}

机构：

[1] Univ Lorraine, CNRS, LRGP, F-54001 Nancy, France

来源：

ALGORITHMS | 2023年 / 16卷 / 12期

关键词：

machine learning; QSPR/QSAR; high-dimensional data; descriptors; thermodynamic properties; applicability domain; outlier detection; SIMULTANEOUS VARIABLE SELECTION; QUANTITATIVE STRUCTURE-ACTIVITY; APPLICABILITY DOMAIN; QSAR MODELS; IDENTIFICATION; CHEMISTRY; SPACE; SET;

D O I：

10.3390/a16120573

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE and RMSE (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).

引用

页数：46

共 23 条

[1] On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 1-From Data Collection to Model Construction: Understanding of the Methods and Their Effects
Trinh, Cindy
Tbatou, Youssef
Lasala, Silvia
Herbinet, Olivier
Meimaroglou, Dimitrios
PROCESSES, 2023, 11 (12)
[2] Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
Tayyebi, Arash
Alshami, Ali S.
Rabiei, Zeinab
Yu, Xue
Ismail, Nadhem
Talukder, Musabbir Jahan
Power, Jason
JOURNAL OF CHEMINFORMATICS, 2023, 15 (01)
[3] Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
Arash Tayyebi
Ali S Alshami
Zeinab Rabiei
Xue Yu
Nadhem Ismail
Musabbir Jahan Talukder
Jason Power
Journal of Cheminformatics, 15
[4] Electronic Properties of CrB/Co2CO2 Superlattices by Multiple Descriptor-Based Machine Learning Combined with First-Principles
Yuan, Yuanyuan
Ren, Junqiang
Xue, Hongtao
Li, Junchen
Tang, Fuling
Guo, Xin
Lu, Xuefeng
SMALL METHODS, 2024, 8 (08):
[5] Estimation of the applicability domain of kernel-based machine learning models for virtual screening
Nikolas Fechner
Andreas Jahn
Georg Hinselmann
Andreas Zell
Journal of Cheminformatics, 2
[6] Estimation of the applicability domain of kernel-based machine learning models for virtual screening
Fechner, Nikolas
Jahn, Andreas
Hinselmann, Georg
Zell, Andreas
JOURNAL OF CHEMINFORMATICS, 2010, 2
[7] Development of machine learning based prediction models for hazardous properties of chemical mixtures
Jiao, Zeren
Ji, Chenxi
Yuan, Shuai
Zhang, Zhuoran
Wang, Qingsheng
JOURNAL OF LOSS PREVENTION IN THE PROCESS INDUSTRIES, 2020, 67
[8] Development and Evaluation of Machine Learning Based Predictive Models for Tribological Properties of Blended Coatings at Elevated Temperature
Jagadesh Kumar Jatavallabhula
Shabana Shabana
Bridjesh Pappula
Journal of Bio- and Tribo-Corrosion, 2025, 11 (1)
[9] Machine-Learning-Based Emission Models in Gasoline Powertrains-Part 2: Virtual Carbon Monoxide
Kempema, Nathan J.
Sharpe, Conner
Wu, Xiao
Shahabi, Mehrdad
Kubinski, David
SAE INTERNATIONAL JOURNAL OF ENGINES, 2023, 16 (06) : 799 - 807
[10] Development and application of machine learning models for prediction of soil available cadmium based on soil properties and climate features
Yang, Zhihui
Xia, Hui
Guo, Ziyun
Xie, Yanyan
Liao, Qi
Yang, Weichun
Li, Qingzhu
Dong, Chunhua
Si, Mengying
ENVIRONMENTAL POLLUTION, 2024, 355

← 1 2 3 →