Fault Modeling of Extreme Scale Applications using Machine Learning

被引:14
|
作者
Vishnu, Abhinav [1 ]
van Dam, Hubertus [2 ]
Tallent, Nathan R. [1 ]
Kerbyson, Darren J. [1 ]
Hoisie, Adolfy [1 ]
机构
[1] Pacific Northwest Natl Lab, Richland, WA 99352 USA
[2] Brookhaven Natl Lab, Upton, NY 11973 USA
关键词
D O I
10.1109/IPDPS.2016.111
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
引用
收藏
页码:222 / 231
页数:10
相关论文
共 50 条
  • [1] A Fault Diagnosis Method by Using Extreme Learning Machine
    Wang, Chunxia
    Wen, Chenglin
    Lu, Yang
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ESTIMATION, DETECTION AND INFORMATION FUSION ICEDIF 2015, 2015, : 318 - 322
  • [2] Large scale extreme learning machine using MapReduce
    Dong, Li
    Zhisong, Pan
    Zhantao, Deng
    Yanyan, Zhang
    International Journal of Digital Content Technology and its Applications, 2012, 6 (20) : 62 - 70
  • [3] Intelligent Detection of High Impedance Fault using Extreme Learning Machine
    Gupta, Sunidhi
    Shihabudheen, K., V
    Anju, M.
    Kunju, Bijuna
    APPEEC 2021: 2021 13TH IEEE PES ASIA PACIFIC POWER & ENERGY ENGINEERING CONFERENCE (APPEEC), 2021,
  • [4] Software fault classification using extreme learning machine: a cognitive approach
    Anil Kumar Pandey
    Manjari Gupta
    Evolutionary Intelligence, 2022, 15 : 2261 - 2268
  • [5] Aero Engine Fault Diagnosis Using an Optimized Extreme Learning Machine
    Yang, Xinyi
    Pang, Shan
    Shen, Wei
    Lin, Xuesen
    Jiang, Keyi
    Wang, Yonghua
    INTERNATIONAL JOURNAL OF AEROSPACE ENGINEERING, 2016, 2016
  • [6] Software fault classification using extreme learning machine: a cognitive approach
    Pandey, Anil Kumar
    Gupta, Manjari
    EVOLUTIONARY INTELLIGENCE, 2022, 15 (04) : 2261 - 2268
  • [7] Machine learning applications in genome-scale metabolic modeling
    Kim, Yeji
    Kim, Gi Bae
    Lee, Sang Yup
    CURRENT OPINION IN SYSTEMS BIOLOGY, 2021, 25 : 42 - 49
  • [8] Extreme learning machine and its applications
    Ding, Shifei
    Xu, Xinzheng
    Nie, Ru
    NEURAL COMPUTING & APPLICATIONS, 2014, 25 (3-4): : 549 - 556
  • [9] Extreme learning machine: Theory and applications
    Huang, Guang-Bin
    Zhu, Qin-Yu
    Siew, Chee-Kheong
    NEUROCOMPUTING, 2006, 70 (1-3) : 489 - 501
  • [10] Extreme learning machine and its applications
    Shifei Ding
    Xinzheng Xu
    Ru Nie
    Neural Computing and Applications, 2014, 25 : 549 - 556