A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

被引:0
|
作者
Gao, Jian [1 ]
Wei, Hongmei [1 ]
Yu, Kang [1 ]
Qing, Peng [1 ]
机构
[1] Jiangnan Inst Comp Technol, Wuxi 214083, Jiangsu, Peoples R China
关键词
High-performance computing; Fault localization; Message-passing; Distributed; FAILURES;
D O I
10.1007/s10766-017-0526-x
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.
引用
收藏
页码:749 / 761
页数:13
相关论文
共 50 条
  • [41] Scalable Session Programming for Heterogeneous High-Performance Systems
    Ng, Nicholas
    Yoshida, Nobuko
    Luk, Wayne
    SOFTWARE ENGINEERING AND FORMAL METHODS, 2014, 8368 : 82 - 98
  • [42] Scalable deep text comprehension for Cancer surveillance on high-performance computing
    John X. Qiu
    Hong-Jun Yoon
    Kshitij Srivastava
    Thomas P. Watson
    J. Blair Christian
    Arvind Ramanathan
    Xiao C. Wu
    Paul A. Fearn
    Georgia D. Tourassi
    BMC Bioinformatics, 19
  • [43] ScalaTrace: Scalable compression and replay of communication traces for high-performance computing
    Noeth, Michael
    Ratn, Prasun
    Mueller, Frank
    Schulz, Martin
    de Supinski, Bronis R.
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (08) : 696 - 710
  • [44] Mobiliti: Scalable Transportation Simulation Using High-Performance Parallel Computing
    Chan, Cy
    Wang, Bin
    Bachan, John
    Macfarlane, Jane
    2018 21ST INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2018, : 634 - 641
  • [45] Scalable deep text comprehension for Cancer surveillance on high-performance computing
    Qiu, John X.
    Yoon, Hong-Jun
    Srivastava, Kshitij
    Watson, Thomas P.
    Christian, J. Blair
    Ramanathan, Arvind
    Wu, Xiao C.
    Fearn, Paul A.
    Tourassi, Georgia D.
    BMC BIOINFORMATICS, 2018, 19
  • [46] A High-Performance and Scalable Distributed Storage and Computing System for IMS Services
    Seraoui, Youssef
    Bellafkih, Mostafa
    Raouyane, Brahim
    2016 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGIES AND APPLICATIONS (CLOUDTECH), 2016, : 335 - 342
  • [47] SPRINT: Scalable Photonic Switching Fabric for High-Performance Computing (HPC)
    Neel, Brian
    Morris, Randy
    Ditomaso, Dominic
    Kodi, Avinash
    JOURNAL OF OPTICAL COMMUNICATIONS AND NETWORKING, 2012, 4 (09) : A38 - A47
  • [48] Adaptive Fault Management of Parallel Applications for High-Performance Computing
    Lan, Zhiling
    Li, Yawei
    IEEE TRANSACTIONS ON COMPUTERS, 2008, 57 (12) : 1647 - 1660
  • [49] M2C: A Massive Performance and Energy Throttling Framework for High-Performance Computing Systems
    Ashraf, Muhammad Usman
    Jambi, Kamal M.
    Arshad, Amna
    Aslam, Rabia
    Ilyas, Iqra
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (07) : 529 - 541
  • [50] M2C: A massive performance and energy throttling framework for high-performance computing systems
    Ashraf M.U.
    Jambi K.M.
    Arshad A.
    Aslam R.
    Ilyas I.
    International Journal of Advanced Computer Science and Applications, 2020, 11 (07): : 529 - 541