A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

被引:0
|
作者
Gao, Jian [1 ]
Wei, Hongmei [1 ]
Yu, Kang [1 ]
Qing, Peng [1 ]
机构
[1] Jiangnan Inst Comp Technol, Wuxi 214083, Jiangsu, Peoples R China
关键词
High-performance computing; Fault localization; Message-passing; Distributed; FAILURES;
D O I
10.1007/s10766-017-0526-x
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
引用
收藏
页码:749 / 761
页数:13
相关论文
共 50 条
  • [1] A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems
    Jian Gao
    Hongmei Wei
    Kang Yu
    Peng Qing
    International Journal of Parallel Programming, 2018, 46 : 749 - 761
  • [2] A Scalable Runtime Fault Detection Mechanism for High Performance Computing
    Gao, Jian
    Yu, Kang
    Qing, Peng
    PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 490 - 495
  • [3] Scalable I/O Forwarding Framework for High-Performance Computing Systems
    Ali, Nawab
    Carns, Philip
    Iskra, Kamil
    Kimpe, Dries
    Lang, Samuel
    Latham, Robert
    Ross, Robert
    Ward, Lee
    Sadayappan, P.
    2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 86 - +
  • [4] Fault-Aware Runtime Strategies for High-Performance Computing
    Li, Yawei
    Lan, Zhiling
    Gujrati, Prashasta
    Sun, Xian-He
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (04) : 460 - 473
  • [5] IKAROS: A scalable I/O framework for high-performance computing systems.
    Filippidis, Christos
    Tsanakas, Panayiotis
    Cotronis, Yiannis
    JOURNAL OF SYSTEMS AND SOFTWARE, 2016, 118 : 277 - 287
  • [6] Scalable Approach to Failure Analysis of High-Performance Computing Systems
    Shawky, Doaa
    ETRI JOURNAL, 2014, 36 (06) : 1023 - 1031
  • [7] ALiCE: A scalable runtime infrastructure for high performance grid computing
    Teo, YM
    Wang, XB
    NETWORK AND PARALLEL COMPUTING, PROCEEDINGS, 2004, 3222 : 101 - 109
  • [8] A scalable framework for online power modelling of high-performance computing nodes in production
    Pittino, Federico
    Beneventi, Francesco
    Bartolini, Andrea
    Benini, Luca
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 300 - 307
  • [9] An Extended IMS Framework With a High-Performance and Scalable Distributed Storage and Computing System
    Seraoui, Youssef
    Raouyane, Brahim
    Bellafkih, Mostafa
    2017 INTERNATIONAL SYMPOSIUM ON NETWORKS, COMPUTERS AND COMMUNICATIONS (ISNCC), 2017,
  • [10] Scalable Embedded Systems: Towards the Convergence of High-Performance and Embedded Computing
    Giorgi, Roberto
    PROCEEDINGS IEEE/IFIP 13TH INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING 2015, 2015, : 148 - 153