Fault prediction under the microscope: A closer look into HPC systems

被引:0
|
作者
Gainaru, Ana [1 ]
Cappello, Franck [2 ]
Snir, Marc [3 ]
Kramer, William [4 ]
机构
[1] UIUC, Dept Comp Sci, Urbana, IL 61801 USA
[2] INRIA, Paris, France
[3] MCS, ANL, Lemont, IL USA
[4] UIUC, NCSA, Urbana, IL USA
关键词
fault tolerance; large-scale HPC systems; signal analysis; fault detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.
引用
下载
收藏
页数:11
相关论文
共 50 条
  • [21] Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques
    Saadoon, Muntadher
    Hamid, Siti Hafizah Ab
    Sofian, Hazrina
    Altarturi, Hamza
    Nasuha, Nur
    Azizul, Zati Hakim
    Sani, Asmiza Abdul
    Asemi, Adeleh
    SENSORS, 2021, 21 (11)
  • [22] A closer look at single object tracking under variable haze
    Satbir Singh
    Nikhil Lamba
    Arun Khosla
    Multimedia Tools and Applications, 2024, 83 (38) : 85755 - 85780
  • [24] Time evolution in quantum systems: A closer look at student understanding
    Passante G.
    Schermerhorn B.P.
    Pollock S.J.
    Sadaghiani H.R.
    European Journal of Physics, 2020, 41 (01)
  • [25] INPUT TO OUTPUT - A CLOSER LOOK AT PLANT PROCESS MEASUREMENT SYSTEMS
    PANNELL, GL
    TRANSACTIONS OF THE AMERICAN NUCLEAR SOCIETY, 1983, 44 : 100 - 102
  • [26] Resource sharing in EDF-scheduled systems: a closer look
    Baruah, Sanjoy K.
    27TH IEEE INTERNATIONAL REAL-TIME SYSTEMS SYMPOSIUM, PROCEEDINGS, 2006, : 379 - 387
  • [27] Relationship of planning and control systems with strategic choices: A closer look
    Shih M.S.H.
    Yong L.-C.
    Asia Pacific Journal of Management, 2001, 18 (4) : 481 - 501
  • [28] Exploring energy saving opportunities in fault tolerant HPC systems
    Moran, Marina
    Balladini, Javier
    Rexachs, Dolores
    Rucci, Enzo
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 185
  • [29] Power and Performance of GPU-accelerated Systems: A Closer Look
    Abe, Yuki
    Sasaki, Hiroshi
    Kato, Shinpei
    Inoue, Koji
    Edahiro, Masato
    Peres, Martin
    2013 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2013), 2013, : 109 - +
  • [30] A machine learning approach to online fault classification in HPC systems
    Netti, Alessio
    Kiziltan, Zeynep
    Babaoglu, Ozalp
    Sirbu, Alina
    Bartolini, Andrea
    Borghesi, Andrea
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 1009 - 1022