A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

被引：0

作者：

Gao, Jian ^{[1
]}

Wei, Hongmei ^{[1
]}

Yu, Kang ^{[1
]}

Qing, Peng ^{[1
]}

机构：

[1] Jiangnan Inst Comp Technol, Wuxi 214083, Jiangsu, Peoples R China

来源：

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING | 2018年 / 46卷 / 04期

关键词：

High-performance computing; Fault localization; Message-passing; Distributed; FAILURES;

D O I：

10.1007/s10766-017-0526-x

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Fault localization has become an increasingly challenging issue in high-performance computing (HPC) systems. Various techniques have been used for HPC systems. However, as the HPC systems scale out, resulting in the rapid deterioration of the existing techniques. In this context, we propose a message-passing based fault localization framework, namely MPFL, which provides a light-weight distributed service using tree-based fault detection (TFD) and fault analysis (TFA) algorithms. In essence, MPFL serves as a fault localization engine within message-passing libraries by enabling several system middleware such as job scheduler to provide abnormal information. We present details of the MPFL framework, including the implementation of TFD and TFA. Further, we develop the fault localization engine prototype within MVAPICH2. The experimental evaluation is performed on a typical HPC cluster with 10 computing nodes, which demonstrate the capability of MPFL and show that the MPFL service does not affect the performance of an application in practice.

引用

页码：749 / 761

页数：13

共 50 条

[1] A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems
Jian Gao
Hongmei Wei
Kang Yu
Peng Qing
International Journal of Parallel Programming, 2018, 46 : 749 - 761
[2] A Scalable Runtime Fault Detection Mechanism for High Performance Computing
Gao, Jian
Yu, Kang
Qing, Peng
PROCEEDINGS OF 2017 IEEE 2ND INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2017, : 490 - 495
[3] Scalable I/O Forwarding Framework for High-Performance Computing Systems
Ali, Nawab
Carns, Philip
Iskra, Kamil
Kimpe, Dries
Lang, Samuel
Latham, Robert
Ross, Robert
Ward, Lee
Sadayappan, P.
2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 86 - +
[4] Fault-Aware Runtime Strategies for High-Performance Computing
Li, Yawei
Lan, Zhiling
Gujrati, Prashasta
Sun, Xian-He
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (04) : 460 - 473
[5] IKAROS: A scalable I/O framework for high-performance computing systems.
Filippidis, Christos
Tsanakas, Panayiotis
Cotronis, Yiannis
JOURNAL OF SYSTEMS AND SOFTWARE, 2016, 118 : 277 - 287
[6] Scalable Approach to Failure Analysis of High-Performance Computing Systems
Shawky, Doaa
ETRI JOURNAL, 2014, 36 (06) : 1023 - 1031
[7] ALiCE: A scalable runtime infrastructure for high performance grid computing
Teo, YM
Wang, XB
NETWORK AND PARALLEL COMPUTING, PROCEEDINGS, 2004, 3222 : 101 - 109
[8] A scalable framework for online power modelling of high-performance computing nodes in production
Pittino, Federico
Beneventi, Francesco
Bartolini, Andrea
Benini, Luca
PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 300 - 307
[9] An Extended IMS Framework With a High-Performance and Scalable Distributed Storage and Computing System
Seraoui, Youssef
Raouyane, Brahim
Bellafkih, Mostafa
2017 INTERNATIONAL SYMPOSIUM ON NETWORKS, COMPUTERS AND COMMUNICATIONS (ISNCC), 2017,
[10] Scalable Embedded Systems: Towards the Convergence of High-Performance and Embedded Computing
Giorgi, Roberto
PROCEEDINGS IEEE/IFIP 13TH INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING 2015, 2015, : 148 - 153

← 1 2 3 4 5 →