Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
|
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 50 条
  • [21] An in-depth analysis and study of Load balancing techniques in the cloud computing environment
    Gopinath, Geethu P. P.
    Vasudevan, Shriram K.
    BIG DATA, CLOUD AND COMPUTING CHALLENGES, 2015, 50 : 427 - 432
  • [22] In-Depth Analysis of OLAP Query Performance on Heterogeneous Hardware
    Broneske, David
    Drewes, Anna
    Gurumurthy, Bala
    Hajjar, Imad
    Pionteck, Thilo
    Saake, Gunter
    Datenbank-Spektrum, 2021, 21 (02) : 133 - 143
  • [23] An In-Depth I/O Pattern Analysis in HPC Systems
    Bang, Jiwoo
    Kim, Chungyong
    Wu, Kesheng
    Sim, Alex
    Byna, Suren
    Sung, Hanul
    Eom, Hyeonsang
    2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), 2021, : 400 - 405
  • [24] An in-depth analysis of the impact of processor affinity on network performance
    Foong, A
    Fung, J
    Newell, D
    2004 12TH IEEE INTERNATIONAL CONFERENCE ON NETWORKS, VOLS 1 AND 2 , PROCEEDINGS: UNITY IN DIVERSITY, 2004, : 244 - 250
  • [25] In-depth Analysis of Czech Systems of Sickness and Health Insurance
    Daler, Jan
    ERA OF SCIENCE DIPLOMACY: IMPLICATIONS FOR ECONOMICS, BUSINESS, MANAGEMENT AND RELATED DISCIPLINES (EDAMBA 2015), 2015, : 103 - 113
  • [26] Avaddon ransomware: An in-depth analysis and decryption of infected systems
    Yuste, Javier
    Pastrana, Sergio
    COMPUTERS & SECURITY, 2021, 109
  • [27] In-depth cross-coupling analysis in high-performance induction motor control
    Amezquita-Brooks, Luis A.
    Ugalde-Loo, Carlos E.
    Liceaga-Castro, Eduardo
    Liceaga-Castro, Jesus
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2018, 355 (05): : 2142 - 2178
  • [28] In-Depth Study of Cyclodextrin Complexation with Carotenoids toward the Formation of Enhanced Delivery Systems
    Clercq, Sebastien
    Temelli, Feral
    Badens, Elisabeth
    MOLECULAR PHARMACEUTICS, 2021, 18 (04) : 1720 - 1729
  • [29] The In-Depth Analysis of Addiction
    Bedir, Emel
    ADDICTA-THE TURKISH JOURNAL ON ADDICTIONS, 2016, 3 (03): : 476 - 479
  • [30] The Performance of MapReduce: An In-depth Study
    Jiang, Dawei
    Ooi, Beng Chin
    Shi, Lei
    Wu, Sai
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 472 - 483