Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

被引:22
|
作者
Jauk, David [1 ]
Yang, Dai [2 ]
Schulz, Martin [2 ]
机构
[1] Tech Univ Munich, Dept Informat, Garching, Germany
[2] Tech Univ Munich, Chair Comp Architecture & Parallel Syst, Garching, Germany
关键词
High Performance Computing; Fault Prediction; Resillience; Exascale Computing; BAYESIAN SERIAL REVISION; FAILURE PREDICTION; CLUSTER;
D O I
10.1145/3295500.3356185
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems
    Shilpika, Shilpika
    Lusch, Bethany
    Emani, Murali
    Simini, Filippo
    Vishwanath, Venkatram
    Papka, Michael E.
    Ma, Kwan-Liu
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 716 - 725
  • [2] Criteria for modeling accuracy: A state-of-the-practice survey
    Hasselman, Timothy K.
    Coppolino, Robert N.
    Zimmerman, David C.
    Proceedings of the International Modal Analysis Conference - IMAC, 2000, 1 : 335 - 341
  • [3] Highway bridge inspection - State-of-the-practice survey
    Rolander, DD
    Phares, BM
    Graybeal, BA
    Moore, ME
    Washer, GA
    MAINTENANCE OF TRANSPORTATION PAVEMENTS AND STRUCTURES: MAINTENANCE, 2001, (1749): : 73 - 81
  • [4] Criteria for modeling accuracy: A state-of-the-practice survey
    Hasselman, TK
    Coppolino, RN
    Zimmerman, DC
    IMAC-XVIII: A CONFERENCE ON STRUCTURAL DYNAMICS, VOLS 1 AND 2, PROCEEDINGS, 2000, 4062 : 335 - 341
  • [5] IN-DEPTH LOOK AT PRACTICE PERFORMANCE
    不详
    VETERINARY ECONOMICS, 1979, 20 (03): : 24 - 28
  • [6] EMERGENCY MONITORING AND ASSESSMENT SYSTEMS - STATE-OF-THE-PRACTICE
    IACOVINO, JM
    TRANSACTIONS OF THE AMERICAN NUCLEAR SOCIETY, 1981, 39 : 722 - 723
  • [7] Original software component manufacturing:: Survey of the state-of-the-practice
    Seppänen, V
    Helander, N
    Niemelä, E
    Komi-Sirviö, S
    PROCEEDINGS OF THE 27TH EUROMICRO CONFERENCE - 2001: A NET ODYSSEY, 2001, : 138 - 145
  • [8] Circular Business Processes in the State-of-the-Practice: A Survey Study
    van Engelenhoven, Tanja
    Kassahun, Ayalew
    Tekinerdogan, Bedir
    SUSTAINABILITY, 2021, 13 (23)
  • [9] Evaluating and quantifying segregation in asphalt pavement construction: A state-of-the-practice survey
    Shi, Jiachen
    Gong, Hongren
    Cong, Lin
    Liang, Haimei
    Ren, Minda
    CONSTRUCTION AND BUILDING MATERIALS, 2023, 383
  • [10] Containerization for High Performance Computing Systems: Survey and Prospects
    Zhou, Naweiluo
    Zhou, Huan
    Hoppe, Dennis
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (04) : 2722 - 2740