Fault Modeling of Extreme Scale Applications using Machine Learning

被引:14
|
作者
Vishnu, Abhinav [1 ]
van Dam, Hubertus [2 ]
Tallent, Nathan R. [1 ]
Kerbyson, Darren J. [1 ]
Hoisie, Adolfy [1 ]
机构
[1] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[2] Brookhaven Natl Lab, Upton, NY 11973 USA
关键词
D O I
10.1109/IPDPS.2016.111
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
引用
收藏
页码:222 / 231
页数:10
相关论文
共 50 条
  • [41] Applications of machine learning to machine fault diagnosis: A review and roadmap
    Lei, Yaguo
    Yang, Bin
    Jiang, Xinwei
    Jia, Feng
    Li, Naipeng
    Nandi, Asoke K.
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2020, 138
  • [42] Extreme learning machine based transfer learning for aero engine fault diagnosis
    Zhao, Yong-Ping
    Chen, Yao-Bin
    AEROSPACE SCIENCE AND TECHNOLOGY, 2022, 121
  • [43] Application of Extreme Learning Machine to Reservoir Proxy Modeling
    Alguliyev, Rasim
    Imamverdiyev, Yadigar
    Sukhostat, Lyudmila
    ENVIRONMENTAL MODELING & ASSESSMENT, 2022, 27 (05) : 869 - 881
  • [44] Robust extreme learning machine for modeling with unknown noise
    Zhang, Jie
    Li, Yanjiao
    Xiao, Wendong
    Zhang, Zhiqiang
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2020, 357 (14): : 9885 - 9908
  • [45] Extreme Learning Machine Based Robotic Arm Modeling
    Alcin, Omer Faruk
    Ucar, Ferhat
    Korkmaz, Deniz
    2016 21ST INTERNATIONAL CONFERENCE ON METHODS AND MODELS IN AUTOMATION AND ROBOTICS (MMAR), 2016, : 1160 - 1163
  • [46] Dynamic Load Modeling based on Extreme Learning Machine
    Liu, Zhonghui
    Wang, Zhenshu
    Su, Meihua
    MECHANICAL ENGINEERING AND INTELLIGENT SYSTEMS, PTS 1 AND 2, 2012, 195-196 : 1043 - +
  • [47] An extreme learning machine application in geophysical modeling for scatterometer
    Duan, Boheng
    Zhang, Weimin
    Zhu, Chengzhang
    Journal of Computational Information Systems, 2014, 10 (22): : 9797 - 9804
  • [48] Is Extreme Learning Machine Effective for Multisource Friction Modeling?
    Kabzinski, Jacek
    Artificial Intelligence Applications and Innovations, 2015, 458 : 318 - 333
  • [49] Sensor Fault Detection using Machine Learning Technique for Automobile Drive Applications
    Argawal, Ritik
    Kalel, Dattatraya
    Harshit, M.
    Domnic, Arun D.
    Singh, R. Raja
    2021 NATIONAL POWER ELECTRONICS CONFERENCE (NPEC), 2021,
  • [50] Application of Extreme Learning Machine to Reservoir Proxy Modeling
    Rasim Alguliyev
    Yadigar Imamverdiyev
    Lyudmila Sukhostat
    Environmental Modeling & Assessment, 2022, 27 : 869 - 881