Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
|
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 50 条
  • [31] IN-DEPTH LOOK AT PRACTICE PERFORMANCE
    不详
    VETERINARY ECONOMICS, 1979, 20 (03): : 24 - 28
  • [32] The path toward HEP High Performance Computing
    Apostolakis, John
    Brun, Rene
    Carminati, Federico
    Gheata, Andrei
    Wenzel, Sandro
    20TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2013), PARTS 1-6, 2014, 513
  • [33] Availability modeling and analysis on high performance cluster computing systems
    Song, Hertong
    Leangsuksun, Chokchai 'box'
    Nassar, Raja
    Gottumukkala, Narasirnha Raju
    Scott, Stephen
    FIRST INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY, PROCEEDINGS, 2006, : 305 - +
  • [34] Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters
    Kousha, Pouya
    Ramesh, Bharath
    Suresh, Kaushik Kandadi
    Chu, Ching-Hsiang
    Jain, Arpan
    Sarkauskas, Nick
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 93 - 102
  • [35] An In-depth Measurement and Analysis of Popular Private Tracker Systems in China
    Li, Qiang
    Qin, Tao
    Guan, Xiaohong
    Zheng, Qinghua
    Huang, Qiuzhen
    2013 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2013,
  • [36] PAPI software-defined events for in-depth performance analysis
    Jagode, Heike
    Danalis, Anthony
    Anzt, Hartwig
    Dongarra, Jack
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (06): : 1113 - 1127
  • [37] An In-depth Analysis of Kerberos and Blockchain Integration on VANETs' Security and Performance
    Rahayu, Maya
    Hossain, Md Biplob
    Huda, Samsul
    Ali, Md Arshad
    Kodera, Yuta
    Nogami, Yasuyuki
    2024 11TH INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN, ICCE-TAIWAN 2024, 2024, : 391 - 392
  • [38] In-Depth Analysis of Photovoltaic-Integrated Shading Systems' Performance in Residential Buildings: A Prospective of Numerical Techniques Toward Net-Zero Energy Buildings
    Baghdadi, Ahmad
    Abuhussain, Maher
    BUILDINGS, 2025, 15 (02)
  • [39] In-depth analysis of evolving binary systems that produce nova eruptions
    Hillman, Yael
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2021, 505 (03) : 3260 - 3272
  • [40] Metagenomic Analysis: A Pathway Toward Efficiency Using High-Performance Computing
    Cervi, Gustavo Henrique
    Flores, Cecilia Dias
    Thompson, Claudia Elizabeth
    PROCEEDINGS OF SIXTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICICT 2021), VOL 2, 2022, 236 : 555 - 565