Fault prediction under the microscope: A closer look into HPC systems

被引:0
|
作者
Gainaru, Ana [1 ]
Cappello, Franck [2 ]
Snir, Marc [3 ]
Kramer, William [4 ]
机构
[1] UIUC, Dept Comp Sci, Urbana, IL 61801 USA
[2] INRIA, Paris, France
[3] MCS, ANL, Lemont, IL USA
[4] UIUC, NCSA, Urbana, IL USA
关键词
fault tolerance; large-scale HPC systems; signal analysis; fault detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Admission systems to dental school in Europe: a closer look at Flanders
    Buyse, T.
    Lievens, F.
    Martens, L.
    EUROPEAN JOURNAL OF DENTAL EDUCATION, 2010, 14 (04) : 215 - 220
  • [32] Online Fault Classification in HPC Systems Through Machine Learning
    Netti, Alessio
    Kiziltan, Zeynep
    Babaoglu, Ozalp
    Sirbu, Alina
    Bartolini, Andrea
    Borghesi, Andrea
    EURO-PAR 2019: PARALLEL PROCESSING, 2019, 11725 : 3 - 16
  • [33] Closer Look at the Uncertainty Estimation in Semantic Segmentation under Distributional Shift
    Cygert, Sebastian
    Wroblewski, Bartlomiej
    Wozniak, Karol
    Slowinski, Radoslaw
    Czyzewski, Andrzej
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [34] A Closer Look at Access Control in Multi-User Voice Systems
    Shafei, Hassan A.
    Tan, Chiu C.
    IEEE ACCESS, 2024, 12 : 40933 - 40946
  • [35] A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems
    Snyder, Shane
    Carns, Philip
    Jenkins, Jonathan
    Harms, Kevin
    Ross, Robert
    Mubarak, Misbah
    Carothers, Christopher
    HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 237 - 248
  • [36] Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems
    Bautista-Gomez, Leonardo
    Ropars, Thomas
    Maruyama, Naoya
    Cappello, Franck
    Matsuoka, Satoshi
    2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, : 355 - 363
  • [37] Is Bigger Better? A Closer Look at Small Health Systems in the United States
    Sherry, Tisamarie B.
    Damberg, Cheryl L.
    DeYoreo, Maria
    Bogart, Andy
    Agniel, Denis
    Ridgely, M. Susan
    Escarce, Jose J.
    MEDICAL CARE, 2022, 60 (07) : 504 - 511
  • [38] CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems
    Hussain, Zaeem
    Cui, Xiaolong
    Znati, Taieb
    Melhem, Rami
    2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 569 - 576
  • [39] FASE: A framework for scalable performance prediction of HPC systems and applications
    Grobelny, Eric
    Bueno, David
    Troxel, Ian
    George, Alan D.
    Vetter, Jeffrey S.
    SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2007, 83 (10): : 721 - 745
  • [40] Fault prediction of power electronics modules and systems under complex working conditions
    Di, Yuan
    Jin, Chao
    Bagheri, Behrad
    Shi, Zhe
    Ardakani, Hossein Davari
    Tang, Zhijun
    Lee, Jay
    COMPUTERS IN INDUSTRY, 2018, 97 : 1 - 9