Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

被引:12
|
作者
Di, Sheng [1 ]
Guo, Hanqi [1 ]
Gupta, Rinku [1 ]
Pershey, Eric R. [1 ]
Snir, Marc [2 ]
Cappello, Franck [1 ]
机构
[1] Argonne Natl Lab, MCS, Argonne, IL 60439 USA
[2] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA
关键词
Peta-scale supercomputer; mining correlations; fatal event analysis; reliability-availability-serviceability (RAS); FAILURES;
D O I
10.1109/TPDS.2018.2864184
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.
引用
下载
收藏
页码:361 / 374
页数:14
相关论文
共 50 条
  • [31] Exploring a large-scale multi-modal transportation recommendation system
    Liu, Yang
    Lyu, Cheng
    Liu, Zhiyuan
    Cao, Jinde
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2021, 126
  • [32] Exploring large-scale, distributed system behavior with a focus on information assurance
    Helsinger, A
    Ferguson, W
    Lazarus, R
    DISCEX'01: DARPA INFORMATION SURVIVABILITY CONFERENCE & EXPOSITION II, VOL II, PROCEEDINGS, 2001, : 273 - 286
  • [33] Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems
    Zhou, Wei
    Zhan, Jianfeng
    Meng, Dan
    Zhang, Zhihong
    NETWORK AND PARALLEL COMPUTING, 2010, 6289 : 262 - +
  • [34] SLoG: Large-Scale Logging Middleware for HPC and Big Data Convergence
    Matri, Pierre
    Carns, Philip
    Ross, Robert
    Costan, Alexandru
    Perez, Maria S.
    Antoniu, Gabriel
    2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 1507 - 1512
  • [35] Large-scale GW calculations on pre-exascale HPC systems
    Del Ben, Mauro
    da Jornada, Felipe H.
    Canning, Andrew
    Wichmann, Nathan
    Raman, Karthik
    Sasanka, Ruchira
    Yang, Chao
    Louie, Steven G.
    Deslippe, Jack
    COMPUTER PHYSICS COMMUNICATIONS, 2019, 235 : 187 - 195
  • [36] Large-scale GW calculations on pre-exascale HPC systems
    Del Ben, Mauro
    da Jornada, Felipe H.
    Canning, Andrew
    Louie, Steven
    Deslippe, Jack
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [37] Enabling Large-Scale Linear Systems of Equations on Hybrid HPC Infrastructures
    Astsatryan, H.
    Sahakyan, V.
    Shoukouryan, Yu
    Dayde, M.
    Hurault, A.
    ICT INNOVATIONS 2011, 2011, 150 : 239 - +
  • [38] Towards Identifying Large-scale BGP Events
    Chen, Meng
    Xu, Mingwei
    Song, Xirui
    Yang, Yuan
    40TH ANNUAL IEEE CONFERENCE ON LOCAL COMPUTER NETWORKS (LCN 2015), 2015, : 165 - 168
  • [39] Infrasonic observations of large-scale he events
    Whitaker, Rodney W.
    Mutschlecner, J. Paul
    Davidson, Masha B.
    Noel, Susan D.
    NASA Conference Publication, 1990, (3101):
  • [40] The Disclosure Dilemma - Large-Scale Adverse Events
    Dudzinski, Denise M.
    Hebert, Philip C.
    Foglia, Mary Beth
    Gallagher, Thomas H.
    NEW ENGLAND JOURNAL OF MEDICINE, 2010, 363 (10): : 978 - 986