Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

被引:22
|
作者
Jauk, David [1 ]
Yang, Dai [2 ]
Schulz, Martin [2 ]
机构
[1] Tech Univ Munich, Dept Informat, Garching, Germany
[2] Tech Univ Munich, Chair Comp Architecture & Parallel Syst, Garching, Germany
关键词
High Performance Computing; Fault Prediction; Resillience; Exascale Computing; BAYESIAN SERIAL REVISION; FAILURE PREDICTION; CLUSTER;
D O I
10.1145/3295500.3356185
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Energy-Aware Scheduling for High-Performance Computing Systems: A Survey
    Kocot, Bartlomiej
    Czarnul, Pawel
    Proficz, Jerzy
    ENERGIES, 2023, 16 (02)
  • [32] An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing
    Zhang, Jie
    Jung, Myoungsoo
    NETWORK AND PARALLEL COMPUTING (NPC 2017), 2017, 10578 : 155 - 159
  • [33] Addressing the Context of Use in Mobile Computing: a Survey on the State of the Practice
    Eshet, Eyal
    Bouwman, Harry
    INTERACTING WITH COMPUTERS, 2015, 27 (04) : 392 - 412
  • [34] High Performance Computing (HPC) Implementation: A Survey
    Assiroj, Priati
    Hananto, April Lia
    Fauzi, Ahmad
    Warnars, Harco Leslie Hendric Spits
    2018 INDONESIAN ASSOCIATION FOR PATTERN RECOGNITION INTERNATIONAL CONFERENCE (INAPR), 2018, : 213 - 217
  • [35] The Form of High-Performance Computing: A Survey
    Assiroj, Priati
    Warnars, H. L. H. S.
    Kosala, R.
    Ranti, B.
    Supangat, S.
    Kistijantoro, A., I
    Abdurrachman, E.
    2ND INTERNATIONAL CONFERENCE ON INFORMATICS, ENGINEERING, SCIENCE, AND TECHNOLOGY (INCITEST 2019), 2019, 662
  • [36] In-depth analysis on parallel processing patterns for high-performance Dataframes
    Perera, Niranda
    Sarker, Arup Kumar
    Staylor, Mills
    von Laszewski, Gregor
    Shan, Kaiying
    Kamburugamuve, Supun
    Widanage, Chathura
    Abeykoon, Vibhatha
    Kanewela, Thejaka Amila
    Fox, Geoffrey
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 149 : 250 - 264
  • [37] High Performance Reconfigurable Computing systems
    Smith, MC
    Drager, SL
    Pochet, L
    Peterson, GD
    PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, 2001, : 462 - 465
  • [38] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
    Ifeanyi P. Egwutuoha
    David Levy
    Bran Selic
    Shiping Chen
    The Journal of Supercomputing, 2013, 65 : 1302 - 1326
  • [39] A survey on parallel and distributed multi-agent systems for high performance computing simulations
    Rousset, Alban
    Herrmann, Benedicte
    Lang, Christophe
    Philippe, Laurent
    COMPUTER SCIENCE REVIEW, 2016, 22 : 27 - 46
  • [40] A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
    Egwutuoha, Ifeanyi P.
    Levy, David
    Selic, Bran
    Chen, Shiping
    JOURNAL OF SUPERCOMPUTING, 2013, 65 (03): : 1302 - 1326