Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice

被引:22
|
作者
Jauk, David [1 ]
Yang, Dai [2 ]
Schulz, Martin [2 ]
机构
[1] Tech Univ Munich, Dept Informat, Garching, Germany
[2] Tech Univ Munich, Chair Comp Architecture & Parallel Syst, Garching, Germany
关键词
High Performance Computing; Fault Prediction; Resillience; Exascale Computing; BAYESIAN SERIAL REVISION; FAILURE PREDICTION; CLUSTER;
D O I
10.1145/3295500.3356185
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional, which has the benefit of giving insight into the global system state. This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand the impact and coverage of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the methods, and we show how this can help us to understand the state-of-the-practice of this field and to identify opportunities, gaps as well as future work.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] An In-Depth Empirical Investigation of State-of-the-Art Scheduling Approaches for Cloud Computing
    Ibrahim, Muhammad
    Nabi, Said
    Baz, Abdullah
    Alhakami, Hosam
    Raza, Muhammad Summair
    Hussain, Altaf
    Salah, Khaled
    Djemame, Karim
    IEEE ACCESS, 2020, 8 (08): : 128282 - 128294
  • [22] Sensing and Nondestructive Testing Applications of Terahertz Spectroscopy and Imaging Systems: State-of-the-Art and State-of-the-Practice
    Nsengiyumva, Walter
    Zhong, Shuncong
    Zheng, Longhui
    Liang, Wei
    Wang, Bing
    Huang, Yi
    Chen, Xuefeng
    Shen, Yaochun
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [23] State-of-the-Practice Survey: United States Departments of Transportation Worker Injuries and Safety Program Efforts
    Marji, Lana K.
    Zech, Wesley C.
    Kirby, Jason T.
    SAFETY, 2023, 9 (04)
  • [24] Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art
    Pllana, Sabri
    Brandic, Ivona
    Benkner, Siegfried
    CISIS 2007: FIRST INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, PROCEEDINGS, 2007, : 279 - +
  • [25] Sensing and Nondestructive Testing Applications of Terahertz Spectroscopy and Imaging Systems: State-of-the-Art and State-of-the-Practice
    Nsengiyumva, Walter
    Zhong, Shuncong
    Zheng, Longhui
    Liang, Wei
    Wang, Bing
    Huang, Yi
    Chen, Xuefeng
    Shen, Yaochun
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [26] Performance Analysis of ZF and MMSE Equalizers for MIMO Systems: An In-Depth Study of the High SNR Regime
    Jiang, Yi
    Varanasi, Mahesh K.
    Li, Jian
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2011, 57 (04) : 2008 - 2026
  • [27] A Survey of Faults and Fault-Injection Techniques in Edge Computing Systems
    Pourreza, Maryam
    Narasimhan, Priya
    2023 IEEE INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND COMMUNICATIONS, EDGE, 2023, : 63 - 71
  • [28] Application of Structural Control Systems for the Cables of Cable-Stayed Bridges: State-of-the-Art and State-of-the-Practice
    Ahad Javanmardi
    Khaled Ghaedi
    Fuyun Huang
    Muhammad Usman Hanif
    Alireza Tabrizikahou
    Archives of Computational Methods in Engineering, 2022, 29 : 1611 - 1641
  • [29] Application of Structural Control Systems for the Cables of Cable-Stayed Bridges: State-of-the-Art and State-of-the-Practice
    Javanmardi, Ahad
    Ghaedi, Khaled
    Huang, Fuyun
    Hanif, Muhammad Usman
    Tabrizikahou, Alireza
    ARCHIVES OF COMPUTATIONAL METHODS IN ENGINEERING, 2022, 29 (03) : 1611 - 1641
  • [30] What do malware analysts want from academia? A survey on the state-of-the-practice to guide research developments
    Botacin, Marcus
    PROCEEDINGS OF 27TH INTERNATIONAL SYMPOSIUM ON RESEARCH IN ATTACKS, INTRUSIONS AND DEFENSES, RAID 2024, 2024, : 77 - 96