Toward an In-Depth Analysis of Multifidelity High Performance Computing Systems

被引:2
|
作者
Shilpika, Shilpika [1 ,2 ]
Lusch, Bethany [2 ]
Emani, Murali [2 ]
Simini, Filippo [2 ]
Vishwanath, Venkatram [2 ]
Papka, Michael E. [2 ,3 ]
Ma, Kwan-Liu [1 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Argonne Natl Lab, Argonne Leadership Comp Facil, Argonne, IL 60439 USA
[3] Northern Illinois Univ, Dept Comp Sci, De Kalb, IL 60115 USA
关键词
Error Log Analysis; HPC; Visualization; Time-series Clustering; Machine Learning; Reliability;
D O I
10.1109/CCGrid54584.2022.00081
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To maintain a robust and reliable supercomputing facility, monitoring it and understanding its hardware system events and behaviors is an essential task. Exascale systems will be increasingly heterogeneous, and the volume of systems data, collected from multiple subsystems and components measured at multiple fidelity levels and temporal resolutions, will continue to grow. In this work, we aim to create an effective solution to analyze diverse and massive datasets gathered from the error logs, job logs, and environment logs of an HPC system, such as a Cray XC40 supercomputer. In this work, we build an end-to-end error log analysis system that analyzes the job logs and gleans insights from their correspondence with hardware error logs and environment logs despite their varying temporal and spatial resolutions. Our machine learning pipeline built in our system is similar to 92% accurate in predicting the job exit status and does so with sufficient lead time for evasive actions to be taken before the actual failure event occurs.
引用
收藏
页码:716 / 725
页数:10
相关论文
共 50 条
  • [41] Glycoproteomics: A Balance between High-Throughput and In-Depth Analysis
    Yang, Yang
    Franc, Vojtech
    Heck, Albert J. R.
    TRENDS IN BIOTECHNOLOGY, 2017, 35 (07) : 598 - 609
  • [42] High Performance Reconfigurable Computing systems
    Smith, MC
    Drager, SL
    Pochet, L
    Peterson, GD
    PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, 2001, : 462 - 465
  • [43] Insights into the separation performance of MOFs by high-performance liquid chromatography and in-depth modelling
    Qin, Weiwei
    Silvestre, Martin E.
    Brenner-Weiss, Gerald
    Wang, Zhengbang
    Schmitt, Sophia
    Huebner, Jonas
    Franzreb, Matthias
    SEPARATION AND PURIFICATION TECHNOLOGY, 2015, 156 : 249 - 258
  • [44] Computer ethics: An in-depth analysis
    Computer Ethics Un approfondimento
    Maggiolini, P., 1600, Associazione Elettrotecnica ed Elettronica Italiana, Via Mauzo Macchi 32, Milano, 20121, Italy (12):
  • [45] NONDESTRUCTIVE IN-DEPTH ANALYSIS WITH ESCA
    HOLM, R
    VAKUUM-TECHNIK, 1974, 23 (07): : 208 - 211
  • [46] An In-depth study of Mobile Browser Performance
    Nejati, Javad
    Balasubramanian, Aruna
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 1305 - 1315
  • [47] In-depth analysis of glass fibers
    Günther H. Frischat
    Andreas Czymai
    Microchimica Acta, 1997, 125 : 79 - 82
  • [48] Exploring Task Offloading in Mobile Edge Computing Environments: An In-Depth Review and Prospective Analysis
    Rasool, Mohammad Ashique E.
    Kumar, Anoop
    Islam, Asharul
    Ahmed, Mohammad Nadeem
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES, SIGNAL AND IMAGE PROCESSING, ATSIP 2024, 2024, : 622 - 627
  • [49] An In-Depth Analysis of the Slingshot Interconnect
    De Sensi, Daniele
    Di Girolamo, Salvatore
    McMahon, Kim H.
    Roweth, Duncan
    Hoefler, Torsten
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [50] In-depth analysis of Wikipedia community
    Felipe Ortega, Jose
    Gonzalez-Barahona, Jesus M.
    Robles, Gregorio
    PROCEEDINGS OF ISSI 2007: 11TH INTERNATIONAL CONFERENCE OF THE INTERNATIONAL SOCIETY FOR SCIENTOMETRICS AND INFORMETRICS, VOLS I AND II, 2007, : 910 - +