On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2-Applicability Domain and Outliers

被引:0
|
作者
Trinh, Cindy [1 ]
Lasala, Silvia [1 ]
Herbinet, Olivier [1 ]
Meimaroglou, Dimitrios [1 ]
机构
[1] Univ Lorraine, CNRS, LRGP, F-54001 Nancy, France
关键词
machine learning; QSPR/QSAR; high-dimensional data; descriptors; thermodynamic properties; applicability domain; outlier detection; SIMULTANEOUS VARIABLE SELECTION; QUANTITATIVE STRUCTURE-ACTIVITY; APPLICABILITY DOMAIN; QSAR MODELS; IDENTIFICATION; CHEMISTRY; SPACE; SET;
D O I
10.3390/a16120573
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article investigates the applicability domain (AD) of machine learning (ML) models trained on high-dimensional data, for the prediction of the ideal gas enthalpy of formation and entropy of molecules via descriptors. The AD is crucial as it describes the space of chemical characteristics in which the model can make predictions with a given reliability. This work studies the AD definition of a ML model throughout its development procedure: during data preprocessing, model construction and model deployment. Three AD definition methods, commonly used for outlier detection in high-dimensional problems, are compared: isolation forest (iForest), random forest prediction confidence (RF confidence) and k-nearest neighbors in the 2D projection of descriptor space obtained via t-distributed stochastic neighbor embedding (tSNE2D/kNN). These methods compute an anomaly score that can be used instead of the distance metrics of classical low-dimension AD definition methods, the latter being generally unsuitable for high-dimensional problems. Typically, in low- (high-) dimensional problems, a molecule is considered to lie within the AD if its distance from the training domain (anomaly score) is below a given threshold. During data preprocessing, the three AD definition methods are used to identify outlier molecules and the effect of their removal is investigated. A more significant improvement of model performance is observed when outliers identified with RF confidence are removed (e.g., for a removal of 30% of outliers, the MAE (Mean Absolute Error) of the test dataset is divided by 2.5, 1.6 and 1.1 for RF confidence, iForest and tSNE2D/kNN, respectively). While these three methods identify X-outliers, the effect of other types of outliers, namely Model-outliers and y-outliers, is also investigated. In particular, the elimination of X-outliers followed by that of Model-outliers enables us to divide MAE and RMSE (Root Mean Square Error) by 2 and 3, respectively, while reducing overfitting. The elimination of y-outliers does not display a significant effect on the model performance. During model construction and deployment, the AD serves to verify the position of the test data and of different categories of molecules with respect to the training data and associate this position with their prediction accuracy. For the data that are found to be close to the training data, according to RF confidence, and display high prediction errors, tSNE 2D representations are deployed to identify the possible sources of these errors (e.g., representation of the chemical information in the training data).
引用
收藏
页数:46
相关论文
共 23 条
  • [21] Development of Machine Learning Models to Predict Compressed Sward Height in Walloon Pastures Based on Sentinel-1, Sentinel-2 and Meteorological Data Using Multiple Data Transformations
    Nickmilder, Charles
    Tedde, Anthony
    Dufrasne, Isabelle
    Lessire, Francoise
    Tychon, Bernard
    Curnel, Yannick
    Bindelle, Jerome
    Soyeurt, Helene
    REMOTE SENSING, 2021, 13 (03)
  • [22] Development of Artificial Intelligence-based Machine Learning Models for Predicting Survival In Hormone-Receptor-Positive/HER2-Negative Early Breast Cancer undergoing Neoadjuvant Chemotherapy
    Mastrantoni, Luca
    Garufi, Giovanna
    Maliziola, Noemi
    Di Monte, Elena
    Arcuri, Giorgia
    Frescura, Valentina
    Rotondi, Angelachiara
    Giordano, Giulia
    Carbognin, Luisa
    Fabi, Alessandra
    Paris, Ida
    Franceschini, Gianluca
    Orlandi, Armando
    Palazzo, Antonella
    Scambia, Giovanni
    Tortora, Giampaolo
    Bria, Emilio
    CANCER RESEARCH, 2024, 84 (09)
  • [23] Development and validation of 10-year risk prediction models of cardiovascular disease in Chinese type 2 diabetes mellitus patients in primary care using interpretable machine learning-based methods
    Dong, Weinan
    Wan, Eric Yuk Fai
    Fong, Daniel Yee Tak
    Tan, Kathryn Choon-Beng
    Tsui, Wendy Wing-Sze
    Hui, Eric Ming-Tung
    Chan, King Hong
    Fung, Colman Siu Cheung
    Lam, Cindy Lo Kuen
    DIABETES OBESITY & METABOLISM, 2024, 26 (09): : 3969 - 3987