On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2-Applicability Domain and Outliers

被引：0

作者：

Trinh, Cindy ^{[1
]}

Lasala, Silvia ^{[1
]}

Herbinet, Olivier ^{[1
]}

Meimaroglou, Dimitrios ^{[1
]}

机构：

[1] Univ Lorraine, CNRS, LRGP, F-54001 Nancy, France

来源：

ALGORITHMS | 2023年 / 16卷 / 12期

关键词：

machine learning; QSPR/QSAR; high-dimensional data; descriptors; thermodynamic properties; applicability domain; outlier detection; SIMULTANEOUS VARIABLE SELECTION; QUANTITATIVE STRUCTURE-ACTIVITY; APPLICABILITY DOMAIN; QSAR MODELS; IDENTIFICATION; CHEMISTRY; SPACE; SET;

D O I：

10.3390/a16120573

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE and RMSE (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).

引用

页数：46

共 23 条

[21] Development of Machine Learning Models to Predict Compressed Sward Height in Walloon Pastures Based on Sentinel-1, Sentinel-2 and Meteorological Data Using Multiple Data Transformations
Nickmilder, Charles
Tedde, Anthony
Dufrasne, Isabelle
Lessire, Francoise
Tychon, Bernard
Curnel, Yannick
Bindelle, Jerome
Soyeurt, Helene
REMOTE SENSING, 2021, 13 (03)
[22] Development of Artificial Intelligence-based Machine Learning Models for Predicting Survival In Hormone-Receptor-Positive/HER2-Negative Early Breast Cancer undergoing Neoadjuvant Chemotherapy
Mastrantoni, Luca
Garufi, Giovanna
Maliziola, Noemi
Di Monte, Elena
Arcuri, Giorgia
Frescura, Valentina
Rotondi, Angelachiara
Giordano, Giulia
Carbognin, Luisa
Fabi, Alessandra
Paris, Ida
Franceschini, Gianluca
Orlandi, Armando
Palazzo, Antonella
Scambia, Giovanni
Tortora, Giampaolo
Bria, Emilio
CANCER RESEARCH, 2024, 84 (09)
[23] Development and validation of 10-year risk prediction models of cardiovascular disease in Chinese type 2 diabetes mellitus patients in primary care using interpretable machine learning-based methods
Dong, Weinan
Wan, Eric Yuk Fai
Fong, Daniel Yee Tak
Tan, Kathryn Choon-Beng
Tsui, Wendy Wing-Sze
Hui, Eric Ming-Tung
Chan, King Hong
Fung, Colman Siu Cheung
Lam, Cindy Lo Kuen
DIABETES OBESITY & METABOLISM, 2024, 26 (09): : 3969 - 3987

← 1 2 3 →