An Explainable Model for Fault Detection in HPC Systems

被引:2
|
作者
Molan, Martin [1 ]
Borghesi, Andrea [1 ]
Beneventi, Francesco [1 ]
Guarrasi, Massimiliano [2 ]
Bartolini, Andrea [1 ]
机构
[1] Univ Bologna, Bologna, Italy
[2] CINECA, Reno, Italy
关键词
Machine learning; High performance computing; Fault detection;
D O I
10.1007/978-3-030-90539-2_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.
引用
收藏
页码:378 / 391
页数:14
相关论文
共 50 条
  • [1] A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems
    Snyder, Shane
    Carns, Philip
    Jenkins, Jonathan
    Harms, Kevin
    Ross, Robert
    Mubarak, Misbah
    Carothers, Christopher
    HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 237 - 248
  • [2] An Intelligent Fault Detection Model for Fault Detection in Photovoltaic Systems
    Basnet, Barun
    Chun, Hyunjun
    Bang, Junho
    JOURNAL OF SENSORS, 2020, 2020 (2020)
  • [3] XFDDC: eXplainable Fault Detection Diagnosis and Correction framework for chemical process systems
    Harinarayan, R. Rajesh Alias
    Shalinie, S. Mercy
    PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2022, 165 : 463 - 474
  • [4] Explainable AI for Chiller Fault-Detection Systems: Gaining Human Trust
    Srinivasan, Seshadhri
    Arjunan, Pandarasamy
    Jin, Baihong
    Sangiovanni-Vincentelli, Alberto
    Sultan, Zuraimi
    Poolla, Kameshwar
    COMPUTER, 2021, 54 (10) : 60 - 68
  • [5] FINJ: A Fault Injection Tool for HPC Systems
    Netti, Alessio
    Kiziltan, Zeynep
    Babaoglu, Ozalp
    Sirbu, Alina
    Bartolini, Andrea
    Borghesi, Andrea
    EURO-PAR 2018: PARALLEL PROCESSING WORKSHOPS, 2019, 11339 : 800 - 812
  • [6] Numerical Algorithms for HPC Systems and Fault Tolerance
    Chetverushkin, Boris N.
    Yakobovskiy, Mikhail V.
    Kornilina, Marina A.
    Semenova, Alena V.
    PARALLEL COMPUTATIONAL TECHNOLOGIES, PCT 2019, 2019, 1063 : 34 - 44
  • [7] Edge-based Explainable Fault Detection Systems for photovoltaic panels on edge nodes
    Sairam, Seshapalli
    Seshadhri, Subathra
    Marafioti, Giancarlo
    Srinivasan, Seshadhri
    Mathisen, Geir
    Bekiroglu, Korkut
    RENEWABLE ENERGY, 2022, 185 : 1425 - 1440
  • [8] yModel Explainable AI Method for Fault Detection in Inverter-Based Distribution Systems
    Reyes, Alejandro Montano
    Chengu, Ambe
    Gatsis, Nikolaos
    Ahmed, Sara
    Alamaniotis, Miltiadis
    2024 IEEE TEXAS POWER AND ENERGY CONFERENCE, TPEC, 2024, : 502 - 507
  • [9] Failure Detection and Propagation in HPC systems
    Bosilca, George
    Bouteiller, Aurelien
    Guermouche, Amina
    Herault, Thomas
    Robert, Yves
    Sens, Pierre
    Dongarra, Jack
    SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 312 - 322
  • [10] Explainable AI for Intrusion Detection Systems: A Model Development and Experts' Evaluation
    Durojaye, Henry
    Naiseh, Mohammad
    INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 2, INTELLISYS 2024, 2024, 1066 : 301 - 318