Error assessment and optimal cross-validation approaches in machine learning applied to impurity diffusion

被引:50
|
作者
Lu, Hai-Jin [1 ,2 ]
Zou, Nan [1 ]
Jacobs, Ryan [2 ]
Afflerbach, Ben [2 ]
Lu, Xiao-Gang [1 ,3 ]
Morgans, Dane [2 ]
机构
[1] Shanghai Univ, Sch Mat Sci & Engn, Shanghai 200072, Peoples R China
[2] Univ Wisconsin, Dept Mat Sci & Engn, Madison, WI 53706 USA
[3] Shanghai Univ, Mat Genome Inst, Shanghai 200072, Peoples R China
基金
美国国家科学基金会;
关键词
Machine learning; Diffusion; Gaussian process; Error assessment; ALLOYING ELEMENTS; COEFFICIENTS;
D O I
10.1016/j.commatsci.2019.06.010
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Machine learning models have been widely utilized in materials science to discover trends in existing data and then make predictions to generate large databases, providing powerful tools for accelerating materials discovery and design. However, there is a significant need to refine approaches both for developing the best models and assessing the uncertainty in their predictions. In this work, we evaluate the performance of Gaussian kernel ridge regression (GKRR) and Gaussian process regression (GPR) for modeling ab-initio predicted impurity diffusion activation energies, using a database with 15 pure metal hosts and 408 host-impurity pairs. We demonstrate the advantages of basing the feature selection on minimizing the Leave-Group-Out (LOG) cross-validation (CV) root mean squared error (RMSE) instead of the more commonly used random K-fold CV RMSE. For the best descriptor and hyperparameter sets, the LOG RMSE from the GKRR (GPR) model is only 0.148 eV (0.155 eV) and the corresponding 5-fold RMSE is 0.116 eV (0.129 eV), demonstrating the model can effectively predict diffusion activation energies. We also show that the ab-initio impurity migration barrier can be employed as a feature to increase the accuracy of the model significantly while still yielding a significant speedup in the ability to predict the activation energy of new systems. Finally, we define r as the magnitude of the ratio of the actual error (residual) in a left-out data point during CV to the predicted standard deviation for that same data point in the GPR model, and compare the distribution of r to a normal distribution. Deviations of r from a normal distribution can be used to quantify the accuracy of the machine learning error estimates, and our results generally show that the approach yields accurate, normally-distributed error estimates for this diffusion data set.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications
    Tougui, Ilias
    Jilbab, Abdelilah
    El Mhamdi, Jamal
    HEALTHCARE INFORMATICS RESEARCH, 2021, 27 (03) : 189 - 199
  • [42] Cross-validation strategy for performance evaluation of machine learning algorithms in underwater acoustic target recognition
    Xu, Yuanchao
    Kong, Xiaopeng
    Cai, Zhiming
    Ocean Engineering, 299
  • [43] Optimal learning of p-layer additive F0 models with cross-validation
    Sakai, Shinsuke
    Kawahara, Tatsuya
    Shimizu, Tohru
    Nakamura, Satoshi
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2009, : 4245 - 4248
  • [44] Utilizing grid search cross-validation with adaptive boosting for augmenting performance of machine learning models
    Adnan, Muhammad
    Alarood, Alaa Abdul Salam
    Uddin, M. Irfan
    Rehman, Izaz Ur
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [45] Cross-validation strategy for performance evaluation of machine learning algorithms in underwater acoustic target recognition
    Xu, Yuanchao
    Kong, Xiaopeng
    Cai, Zhiming
    OCEAN ENGINEERING, 2024, 299
  • [46] OPTIMAL LEARNING OF P-LAYER ADDITIVE F0 MODELS WITH CROSS-VALIDATION
    Sakai, Shinsuke
    Kawahara, Tatsuya
    Shimizu, Tohru
    Nakamura, Satoshi
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4245 - 4248
  • [47] LARGE SAMPLE PROPERTIES OF CROSS-VALIDATION ASSESSMENT STATISTICS
    IVERSON, HK
    RANDLES, RH
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 1986, 15 (01) : 43 - 62
  • [48] Approximate Cross-validation: Guarantees for Model Assessment and Selection
    Wilson, Ashia
    Kasy, Maximilian
    Mackey, Lester
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [49] Cross-validation and cross-study validation of kidney cancer with machine learning and whole exome sequences from the National Cancer Institute
    Aljouie, Abdulrhman
    Patel, Nihir
    Roshan, Usman
    2018 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2018, : 61 - 66
  • [50] Comprehensive Assessment of Emotional Disturbance: A Cross-Validation Approach
    Emily S. Fisher
    Katie E. Doyon
    Enrique Saldaña
    Megan Redding Allen
    The California School Psychologist, 2007, 12 (1): : 47 - 58