Error assessment and optimal cross-validation approaches in machine learning applied to impurity diffusion

被引：50

作者：

Lu, Hai-Jin ^{[1
,2
]}

Zou, Nan ^{[1
]}

Jacobs, Ryan ^{[2
]}

Afflerbach, Ben ^{[2
]}

Lu, Xiao-Gang ^{[1
,3
]}

Morgans, Dane ^{[2
]}

机构：

[1] Shanghai Univ, Sch Mat Sci & Engn, Shanghai 200072, Peoples R China

[2] Univ Wisconsin, Dept Mat Sci & Engn, Madison, WI 53706 USA

[3] Shanghai Univ, Mat Genome Inst, Shanghai 200072, Peoples R China

来源：

COMPUTATIONAL MATERIALS SCIENCE | 2019年 / 169卷

基金：

美国国家科学基金会;

关键词：

Machine learning; Diffusion; Gaussian process; Error assessment; ALLOYING ELEMENTS; COEFFICIENTS;

D O I：

10.1016/j.commatsci.2019.06.010

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Machine learning models have been widely utilized in materials science to discover trends in existing data and then make predictions to generate large databases, providing powerful tools for accelerating materials discovery and design. However, there is a significant need to refine approaches both for developing the best models and assessing the uncertainty in their predictions. In this work, we evaluate the performance of Gaussian kernel ridge regression (GKRR) and Gaussian process regression (GPR) for modeling ab-initio predicted impurity diffusion activation energies, using a database with 15 pure metal hosts and 408 host-impurity pairs. We demonstrate the advantages of basing the feature selection on minimizing the Leave-Group-Out (LOG) cross-validation (CV) root mean squared error (RMSE) instead of the more commonly used random K-fold CV RMSE. For the best descriptor and hyperparameter sets, the LOG RMSE from the GKRR (GPR) model is only 0.148 eV (0.155 eV) and the corresponding 5-fold RMSE is 0.116 eV (0.129 eV), demonstrating the model can effectively predict diffusion activation energies. We also show that the ab-initio impurity migration barrier can be employed as a feature to increase the accuracy of the model significantly while still yielding a significant speedup in the ability to predict the activation energy of new systems. Finally, we define r as the magnitude of the ratio of the actual error (residual) in a left-out data point during CV to the predicted standard deviation for that same data point in the GPR model, and compare the distribution of r to a normal distribution. Deviations of r from a normal distribution can be used to quantify the accuracy of the machine learning error estimates, and our results generally show that the approach yields accurate, normally-distributed error estimates for this diffusion data set.

引用

页数：9

共 50 条

[21] Summative Assessment of Undergraduate Learning Outcomes with Cross-Validation in C Programming Course
Zhang, Yongbin
Liang, Ronghua
Zheng, Yanying
Zhang, Hao
Wang, Ping
Li, Ye
2022 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND EDUCATION TECHNOLOGY (ICIET 2022), 2022, : 295 - 299
[22] Confirmatory factor analysis of the assessment for living and learning scale: A cross-validation investigation
Denzine, GM
Kowalski, GJ
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT, 2002, 35 (01) : 14 - 26
[23] Impact of Cross-Validation on Machine Learning Models for Early Detection of Intrauterine Fetal Demise
Kaliappan, Jayakumar
Bagepalli, Apoorva Reddy
Almal, Shubh
Mishra, Rishabh
Hu, Yuh-Chung
Srinivasan, Kathiravan
DIAGNOSTICS, 2023, 13 (10)
[24] Cross-Validation of a Global Machine Learning Model to Predict COVID-19 Mortality
Minhas, H.
Malik, A.
Kurtz, D.
Fatiwala, Z.
Ahmed, F.
Irfan, F.
Lee, S.
Esber, Z.
AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 2022, 205
[25] Spatial plus : A new cross-validation method to evaluate geospatial machine learning models
Wang, Yanwen
Khodadadzadeh, Mahdi
Zurita-Milla, Raul
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2023, 121
[26] LEARNING-DISABILITY SUBTYPES - A CROSS-VALIDATION
JOHNSTON, CS
FENNELL, EB
SATZ, P
JOURNAL OF CLINICAL AND EXPERIMENTAL NEUROPSYCHOLOGY, 1987, 9 (01) : 28 - 28
[27] Explicit solutions for the asymptotically optimal bandwidth in cross-validation
Abadir, Karim M.
Lubrano, Michel
BIOMETRIKA, 2024, 111 (03)
[28] Selection of optimal regression models via cross-validation
Osten, David W.
Journal of Chemometrics, 1988, 2 (01) : 39 - 48
[29] CROSS-VALIDATION OF AN IBM PROOF MACHINE TEST BATTERY
HARKER, JB
JOURNAL OF APPLIED PSYCHOLOGY, 1960, 44 (04) : 237 - 240
[30] Estimating Prediction Error: Cross-Validation vs. Accumulated Prediction Error
Haggstrom, Jenny
De Luna, Xavier
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2010, 39 (05) : 880 - 898

← 1 2 3 4 5 →