Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

被引:22
|
作者
Jauk, David [1 ]
Yang, Dai [2 ]
Schulz, Martin [2 ]
机构
[1] Tech Univ Munich, Dept Informat, Garching, Germany
[2] Tech Univ Munich, Chair Comp Architecture & Parallel Syst, Garching, Germany
关键词
High Performance Computing; Fault Prediction; Resillience; Exascale Computing; BAYESIAN SERIAL REVISION; FAILURE PREDICTION; CLUSTER;
D O I
10.1145/3295500.3356185
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] A survey of software techniques to emulate heterogeneous memory systems in high-performance computing
    Foyer, Clement
    Goglin, Brice
    Proano, Andres Rubio
    PARALLEL COMPUTING, 2023, 116
  • [42] Insights into the separation performance of MOFs by high-performance liquid chromatography and in-depth modelling
    Qin, Weiwei
    Silvestre, Martin E.
    Brenner-Weiss, Gerald
    Wang, Zhengbang
    Schmitt, Sophia
    Huebner, Jonas
    Franzreb, Matthias
    SEPARATION AND PURIFICATION TECHNOLOGY, 2015, 156 : 249 - 258
  • [43] A Survey of Communication Performance Models for High-Performance Computing
    Rico-Gallego, Juan A.
    Diaz-Martin, Juan C.
    Manumachu, Ravi Reddy
    Lastovetsky, Alexey L.
    ACM COMPUTING SURVEYS, 2019, 51 (06) : 1 - 36
  • [44] In-Depth Analysis of HARQ Performance in Active RIS-Assisted RSMA Systems
    Zheng, Yike
    Tang, Jie
    Zheng, Beixiong
    Davydov, Maksim
    Wong, Kai-Kit
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2024, 13 (11) : 3074 - 3078
  • [45] Adjuvant Therapy for High-Risk Melanoma: An In-Depth Examination of the State of the Field
    Eljilany, Islam
    Castellano, Ella
    Tarhini, Ahmad A.
    CANCERS, 2023, 15 (16)
  • [46] Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators
    Meswani, Mitesh R.
    Carrington, Laura
    Unat, Didem
    Snavely, Allan
    Baden, Scott
    Poole, Stephen
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 1828 - 1837
  • [47] Modeling and predicting performance of high performance computing applications on hardware accelerators
    Meswani, Mitesh R.
    Carrington, Laura
    Unat, Didem
    Snavely, Allan
    Baden, Scott
    Poole, Stephen
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2013, 27 (02): : 89 - 108
  • [48] Preliminary results from a state-of-the-practice survey on risk management in off-the-shelf component-based development
    Li, JY
    Conradi, R
    Slyngstad, OPN
    Torchiano, M
    Morisio, M
    Bunse, C
    COTS-BASED SOFTWARE SYSTEMS, PROCEEDINGS, 2005, 3412 : 278 - 288
  • [49] A State-of-the-Practice Release-Readiness Checklist for Generative AI-Based Software Products: A Gray Literature Survey
    Patel, Harsh
    Boucher, Dominique
    Fallahzadeh, Emad
    Hassan, Ahmed E.
    Adams, Bram
    IEEE SOFTWARE, 2025, 42 (01) : 74 - 83
  • [50] The State of Microbiology Diagnostic of Prosthetic Joint Infection in Europe: An In-Depth Survey Among Clinical Microbiologists
    Yusuf, Erlangga
    Roschka, Charlotte
    Esteban, Jaime
    Raglio, Annibale
    Tisler, Anna
    Willems, Philippe
    Kramer, Tobias Siegfried
    FRONTIERS IN MICROBIOLOGY, 2022, 13