Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
|
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 50 条
  • [1] Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice
    Jauk, David
    Yang, Dai
    Schulz, Martin
    PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,
  • [2] Toward Codesign in High Performance Computing Systems
    Barrett, Richard F.
    Dosanjh, Sudip S.
    Heroux, Michael A.
    Hu, X. S.
    Parker, S.
    Shalf, J.
    2012 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), 2012, : 443 - 449
  • [3] In-depth analysis and open challenges of Mist Computing
    López Escobar, Juan José
    Díaz Redondo, Rebeca P.
    Gil-Castiñeira, Felipe
    Journal of Cloud Computing, 2022, 11 (01)
  • [4] In-depth analysis and open challenges of Mist Computing
    Juan José López Escobar
    Rebeca P. Díaz Redondo
    Felipe Gil-Castiñeira
    Journal of Cloud Computing, 11
  • [5] In-depth analysis and open challenges of Mist Computing
    Lopez Escobar, Juan Jose
    Redondo, Rebeca P. Diaz
    Gil-Castineira, Felipe
    JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS, 2022, 11 (01):
  • [6] Performance Analysis of ZF and MMSE Equalizers for MIMO Systems: An In-Depth Study of the High SNR Regime
    Jiang, Yi
    Varanasi, Mahesh K.
    Li, Jian
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2011, 57 (04) : 2008 - 2026
  • [7] TOWARD A PERFORMANCE SCIENCE - COMPARATIVE ANALYSIS OF COMPUTING SYSTEMS
    CLAPSON, PJ
    COMPUTER JOURNAL, 1977, 20 (04): : 308 - 315
  • [8] An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing
    Zhang, Jie
    Jung, Myoungsoo
    NETWORK AND PARALLEL COMPUTING (NPC 2017), 2017, 10578 : 155 - 159
  • [9] In-depth Performance Analysis of the HyperFlux Spectrometer
    Meade, Jeffrey T.
    Behr, Bradford B.
    Bismilla, Yusuf
    Cenko, Andrew T.
    Hajian, Arsen R.
    ADVANCED BIOMEDICAL AND CLINICAL DIAGNOSTIC SYSTEMS XI, 2013, 8572
  • [10] Multimedia Big Data Computing for In-depth Event Analysis
    Tous, Ruben
    Torres, Jordi
    Ayguade, Eduard
    2015 1ST IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2015, : 144 - 147