Time Machine: Generative Real-Time Model For Failure (and Lead Time) Prediction in HPC Systems

被引:0
|
作者
Alharthi, Khalid Ayed [1 ,5 ,6 ]
Jhumka, Arshad [1 ]
Di, Sheng [2 ]
Gui, Lin [3 ]
Cappello, Franck [2 ,7 ]
McIntosh-Smith, Simon [4 ]
机构
[1] Univ Warwick, Coventry, England
[2] Univ Chicago, Argonne Natl Lab, Chicago, IL USA
[3] Kings Coll London, London, England
[4] Univ Bristol, Bristol, England
[5] Univ Bisha, Bisha, Saudi Arabia
[6] Alan Turing Inst, London, England
[7] Univ Illinois, Champaign, IL USA
关键词
RESOURCE USE; LARGE-SCALE; CLOUD;
D O I
10.1109/DSN58367.2023.00054
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the natural language processing (NLP) tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine1, that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the self-attention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time Machine on four realworld HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.
引用
收藏
页码:508 / 521
页数:14
相关论文
共 50 条
  • [1] The HPC-DAG Task Model for Heterogeneous Real-Time Systems
    Houssam-Eddine, Zahaf
    Capodieci, Nicola
    Cavicchioli, Roberto
    Lipari, Giuseppe
    Bertogna, Marko
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (10) : 1747 - 1761
  • [2] A MODEL FOR REAL-TIME SYSTEMS
    KRISHNAN, P
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1991, 520 : 298 - 307
  • [3] Framework for a Real-Time Autonomous Cascading Failure Prediction Model
    Mahgoub, Mohamed O.
    Mazhari, S. Mandi
    Chung, C. Y.
    Faried, Sherif Omar
    [J]. 2021 IEEE ELECTRICAL POWER AND ENERGY CONFERENCE (EPEC), 2021, : 214 - 219
  • [4] CCS + TIME = AN INTERLEAVING MODEL FOR REAL-TIME SYSTEMS
    YI, W
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1991, 510 : 217 - 228
  • [5] A Time-Phased Machine Learning Model for Real-Time Prediction of Sepsis in Critical Care
    Li, Xiang
    Xu, Xiao
    Xie, Fei
    Xu, Xian
    Sun, Yuyao
    Liu, Xiaoshuang
    Jia, Xiaoyu
    Kang, Yanni
    Xie, Lixin
    Wang, Fei
    Xie, Guotong
    [J]. CRITICAL CARE MEDICINE, 2020, 48 (10) : E884 - E888
  • [6] Real-time Failure Prediction in Online Services
    Shatnawi, Mohammed
    Hefeeda, Mohamed
    [J]. 2015 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (INFOCOM), 2015,
  • [7] Prediction model of bus arrival time for real-time applications
    Jeong, RH
    Rilett, LR
    [J]. TRANSIT: PLANNING, MANAGEMENT AND MAINTENANCE, TECHNOLOGY, MARKETING AND FARE POLICY, AND CAPACITY AND QUALTIY OF SEVICE, 2005, 1927 : 195 - 204
  • [8] The integration of HPC systems in a test and evaluation real-time environment
    Zarecor, R
    Bennett, M
    [J]. INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOL VI, PROCEEDINGS, 1999, : 2894 - 2898
  • [9] A HPC based cloud model for real-time energy optimisation
    Petri, Ioan
    Li, Haijiang
    Rezgui, Yacine
    Yang Chunfeng
    Yuce, Baris
    Jayan, Bejay
    [J]. ENTERPRISE INFORMATION SYSTEMS, 2016, 10 (01) : 108 - 128
  • [10] Real-time constraints and prediction of video decoding time for multimedia systems
    Mattavelli, M
    Brunetton, S
    [J]. MULTIMEDIA APPLICATIONS, SERVICES AND TECHNIQUES - ECMAST'98, 1998, 1425 : 425 - 438