Fault Modeling of Extreme Scale Applications using Machine Learning

被引:14
|
作者
Vishnu, Abhinav [1 ]
van Dam, Hubertus [2 ]
Tallent, Nathan R. [1 ]
Kerbyson, Darren J. [1 ]
Hoisie, Adolfy [1 ]
机构
[1] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[2] Brookhaven Natl Lab, Upton, NY 11973 USA
关键词
D O I
10.1109/IPDPS.2016.111
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
引用
收藏
页码:222 / 231
页数:10
相关论文
共 50 条
  • [21] An extreme learning machine approach for modeling evapotranspiration using extrinsic inputs
    Patil, Amit Prakash
    Deka, Paresh Chandra
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2016, 121 : 385 - 392
  • [22] Fault and Noise Tolerance in the Incremental Extreme Learning Machine
    Leung, Ho Chun
    Leung, Chi Sing
    Wong, Eric Wing Ming
    IEEE ACCESS, 2019, 7 : 155171 - 155183
  • [23] Learning local discriminative representations via extreme learning machine for machine fault diagnosis
    Li, Yue
    Zeng, Yijie
    Qing, Yuanyuan
    Huang, Guang-Bin
    NEUROCOMPUTING, 2020, 409 (409) : 275 - 285
  • [24] MIMO Modeling Based on Extreme Learning Machine
    Liu, Junbiao
    Dong, Fang
    Cao, Jiuwen
    Jin, Xinyu
    PROCEEDINGS OF ELM-2015, VOL 1: THEORY, ALGORITHMS AND APPLICATIONS (I), 2016, 6 : 169 - 178
  • [25] Extreme learning machine: algorithm, theory and applications
    Ding, Shifei
    Zhao, Han
    Zhang, Yanan
    Xu, Xinzheng
    Nie, Ru
    ARTIFICIAL INTELLIGENCE REVIEW, 2015, 44 (01) : 103 - 115
  • [26] Extreme learning machine: algorithm, theory and applications
    Shifei Ding
    Han Zhao
    Yanan Zhang
    Xinzheng Xu
    Ru Nie
    Artificial Intelligence Review, 2015, 44 : 103 - 115
  • [27] Analyzing Machine Learning Techniques for Fault Prediction Using Web Applications
    Malhotra, Ruchika
    Sharma, Anjali
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (03): : 751 - 770
  • [28] Fault diagnosis of electro-hydraulic servo valve using extreme learning machine
    Liu, Chao
    Wang, Yunfang
    Pan, Tianhong
    Zheng, Gang
    INTERNATIONAL TRANSACTIONS ON ELECTRICAL ENERGY SYSTEMS, 2020, 30 (07)
  • [29] Reliable Fault Diagnosis Method using Kernel Extreme Learning Machine for Gear Failures
    Li, Zhichun
    4TH INTERNATIONAL CONFERENCE ON MECHANICAL AUTOMATION AND MATERIALS ENGINEERING (ICMAME 2015), 2015, : 625 - 629
  • [30] Sensor Fault Diagnosis Using Ensemble Empirical Mode Decomposition and Extreme Learning Machine
    Ji, J.
    Qu, J.
    Chai, Y.
    Zhou, Y.
    Tang, Q.
    PROCEEDINGS OF 2016 CHINESE INTELLIGENT SYSTEMS CONFERENCE, VOL I, 2016, 404 : 199 - 209