Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
|
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 50 条
  • [11] Hybrid analysis and modeling, eclecticism, and multifidelity computing toward digital twin revolution
    San, Omer (osan@okstate.edu), 1600, John Wiley and Sons Inc (44):
  • [12] In-depth analysis on parallel processing patterns for high-performance Dataframes
    Perera, Niranda
    Sarker, Arup Kumar
    Staylor, Mills
    von Laszewski, Gregor
    Shan, Kaiying
    Kamburugamuve, Supun
    Widanage, Chathura
    Abeykoon, Vibhatha
    Kanewela, Thejaka Amila
    Fox, Geoffrey
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 149 : 250 - 264
  • [13] In-Depth Analysis of HARQ Performance in Active RIS-Assisted RSMA Systems
    Zheng, Yike
    Tang, Jie
    Zheng, Beixiong
    Davydov, Maksim
    Wong, Kai-Kit
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2024, 13 (11) : 3074 - 3078
  • [14] In-depth Image Analysis Using Advanced Systems
    Jamila Harbi, S.
    Najm, Zeyad Nabeel
    Nonlinear Optics Quantum Optics, 2022, 56 (3-4): : 205 - 216
  • [15] An in-depth analysis of robustness and accuracy of recommendation systems
    Ma, Haonan
    Wang, Can
    Zhao, Yunwei
    Wang, Luhua
    Cao, Xiulian
    Chen, Jinyin
    Han, Han
    Liu, Meichen
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1509 - 1515
  • [16] An In-depth Performance Analysis and Optimization for Android Screencast
    Li, Xianfeng
    An, Dekai
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE2018), 2018,
  • [17] In-depth Image Analysis Using Advanced Systems
    Harbi, Jamila S.
    Najm, Zeyad Nabeel
    NONLINEAR OPTICS QUANTUM OPTICS-CONCEPTS IN MODERN OPTICS, 2022, 56 (3-4): : 205 - 216
  • [19] In-depth analysis
    Wilks, N
    PROFESSIONAL ENGINEERING, 2000, 13 (06) : 20 - 21
  • [20] Toward an in-depth profiling of DTC users
    Oliveri, S.
    Renzi, C.
    Pravettoni, G.
    CLINICAL GENETICS, 2015, 88 (05) : 505 - 506