A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

被引:0
|
作者
Gao, Jian [1 ]
Wei, Hongmei [1 ]
Yu, Kang [1 ]
Qing, Peng [1 ]
机构
[1] Jiangnan Inst Comp Technol, Wuxi 214083, Jiangsu, Peoples R China
关键词
High-performance computing; Fault localization; Message-passing; Distributed; FAILURES;
D O I
10.1007/s10766-017-0526-x
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
引用
收藏
页码:749 / 761
页数:13
相关论文
共 50 条
  • [21] FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing
    Wei Hu
    Guang-Ming Liu
    Yan-Huang Jiang
    Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1273 - 1290
  • [22] FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing
    Hu, Wei
    Liu, Guang-ming
    Jiang, Yan-huang
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1273 - 1290
  • [23] APEX-Map: a parameterized scalable memory access probe for high-performance computing systems
    Strohmaier, Erich
    Shan, Hongzhang
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2007, 19 (17): : 2185 - 2205
  • [24] Performance analysis challenges and framework for high-performance reconfigurable computing
    Koehler, Seth
    Curreri, John
    George, Alan D.
    PARALLEL COMPUTING, 2008, 34 (4-5) : 217 - 230
  • [25] Implementation of High-Performance Computing Technologies in the BmnRoot Framework
    Nemnyugin, S.
    Driuk, A.
    Merts, S.
    Myasnikov, A.
    Stepanova, M.
    Iufryakova, A.
    PHYSICS OF PARTICLES AND NUCLEI, 2023, 54 (04) : 656 - 659
  • [26] Implementation of High-Performance Computing Technologies in the BmnRoot Framework
    S. Nemnyugin
    A. Driuk
    S. Merts
    A. Myasnikov
    M. Stepanova
    A. Iufryakova
    Physics of Particles and Nuclei, 2023, 54 : 656 - 659
  • [27] A Grid Computing Framework for High-Performance Medical Imaging
    Manana Guichon, Gabriel
    Romero Castro, Eduardo
    IX INTERNATIONAL SEMINAR ON MEDICAL INFORMATION PROCESSING AND ANALYSIS, 2013, 8922
  • [28] NEMO A Network Monitoring Framework for High-performance Computing
    Calle, Elio Perez
    DCNET 2010/OPTICS 2010: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA COMMUNICATION NETWORKING AND INTERNATIONAL CONFERENCE ON OPTICAL COMMUNICATION SYSTEM, 2010, : 61 - 66
  • [29] Software Systems for High-performance Quantum Computing
    Humble, Travis S.
    Britt, Keith A.
    2016 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2016,
  • [30] Optical switching in high-performance computing systems
    Lytel, R
    PHOTONICS IN SWITCHING, PROCEEDINGS, 2000, 32 : 176 - 176