Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

被引:12
|
作者
Di, Sheng [1 ]
Guo, Hanqi [1 ]
Gupta, Rinku [1 ]
Pershey, Eric R. [1 ]
Snir, Marc [2 ]
Cappello, Franck [1 ]
机构
[1] Argonne Natl Lab, MCS, Argonne, IL 60439 USA
[2] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA
关键词
Peta-scale supercomputer; mining correlations; fatal event analysis; reliability-availability-serviceability (RAS); FAILURES;
D O I
10.1109/TPDS.2018.2864184
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.
引用
下载
收藏
页码:361 / 374
页数:14
相关论文
共 50 条
  • [1] Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System
    Di S.
    Guo H.
    Gupta R.
    Pershey E.R.
    Snir M.
    Cappello F.
    IEEE Transactions on Parallel and Distributed Systems, 2019, 30 (02): : 361 - 374
  • [2] The analysis of checkpoint strategies for large-scale CFD simulation in HPC system
    Ren Xiaoguang
    Xu Xinhai
    Tang Yuhua
    Fang Xudong
    2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, : 1097 - 1101
  • [3] Coupling HPC and Numerical Validation: Accurate and Efficient Simulation of Large-scale Hydrodynamic Events
    Moulinec, C.
    Denis, C.
    Durand, N.
    Barber, R. W.
    Emerson, D. R.
    Gu, X. J.
    Razafindrakoto, E.
    Issa, R.
    Hervouet, J. -M.
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING, 2011, 95
  • [4] Advanced HPC Methods for Large-scale Sensitivity Analysis
    Cioaca, Alexandru
    PROCEEDINGS OF THE 2015 7TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI), 2015, : E21 - E26
  • [5] Accelerating large-scale HPC Applications using FPGAs
    Dimond, Rob
    Racaniere, Sebastien
    Pell, Oliver
    2011 20TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH-20), 2011, : 191 - 192
  • [6] FDTD Method and HPC for Large-Scale Computational Nanophotonics
    Lesina, Antonino Cala
    Vaccari, Alessandro
    Berini, Pierre
    Ramunno, Lora
    NANO-OPTICS: PRINCIPLES ENABLING BASIC RESEARCH AND APPLICATIONS, 2017, : 435 - 439
  • [7] Large-Scale Multiple Testing of Correlations
    Cai, T. Tony
    Liu, Weidong
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (513) : 229 - 240
  • [8] Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation
    Tiwari, Devesh
    Gupta, Saurabh
    Rogers, James
    Maxwell, Don
    Rech, Paolo
    Vazhkudai, Sudharshan
    Oliveira, Daniel
    Londo, Dave
    DeBardeleben, Nathan
    Navaux, Philippe
    Carro, Luigi
    Bland, Arthur
    2015 IEEE 21ST INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2015, : 331 - 342
  • [9] Use of HPC-Techniques for Large-Scale Data Migration
    Duennweber, Jan
    Mihaylov, Valentin
    Glettler, Rene
    Maiborn, Volker
    Wolff, Holger
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 408 - 415
  • [10] The organization of large-scale sports events
    Ano Sanz, V
    ARBOR-CIENCIA PENSAMIENTO Y CULTURA, 2000, 165 (650) : 265 - 287