A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

被引:0
|
作者
Gao, Jian [1 ]
Wei, Hongmei [1 ]
Yu, Kang [1 ]
Qing, Peng [1 ]
机构
[1] Jiangnan Inst Comp Technol, Wuxi 214083, Jiangsu, Peoples R China
关键词
High-performance computing; Fault localization; Message-passing; Distributed; FAILURES;
D O I
10.1007/s10766-017-0526-x
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
引用
收藏
页码:749 / 761
页数:13
相关论文
共 50 条
  • [31] HIGH-PERFORMANCE COMPUTING SYSTEMS - PRESENT AND FUTURE
    BISIANI, R
    FUTURE GENERATION COMPUTER SYSTEMS, 1994, 10 (2-3) : 241 - 248
  • [33] High-performance computing systems and applications for AI
    Gangman Yi
    Vincenzo Loia
    The Journal of Supercomputing, 2019, 75 : 4248 - 4251
  • [34] Quantum Accelerators for High-Performance Computing Systems
    Britt, Keith A.
    Mohiyaddin, Fahd A.
    Humble, Travis S.
    2017 IEEE INTERNATIONAL CONFERENCE ON REBOOTING COMPUTING (ICRC), 2017, : 198 - 204
  • [35] Parallel Soft Computing Techniques in High-Performance Computing Systems
    Dorronsoro, Bernabe
    Nesmachnow, Sergio
    COMPUTER JOURNAL, 2016, 59 (06): : 775 - 776
  • [36] OPTICAL INTERCONNECTS FOR HIGH-PERFORMANCE COMPUTING SYSTEMS
    Tan, Michael R. T.
    McLaren, Moray
    Jouppi, Norman P.
    IEEE MICRO, 2013, 33 (01) : 14 - 21
  • [37] New advances in high-performance computing systems
    Boeres, Cristina
    Bentes, Cristiana
    Moreno, Edward D.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (18):
  • [38] High-performance computing systems: Status and outlook
    Dongarra, J. J.
    van der Steen, A. J.
    ACTA NUMERICA, 2012, 21 : 379 - 474
  • [39] High-performance computing systems and applications for AI
    Yi, Gangman
    Loia, Vincenzo
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (08): : 4248 - 4251