Fault Tolerance Management for a Hierarchical GridRPC Middleware

被引：0

作者：

Bouteiller, Aurelien ^{[1
]}

Desprez, Frederic ^{[1
]}

机构：

[1] UCBL, INRIA, CNRS ENS Lyon, LIP ENS Lyon,UMR 5668, F-69364 Lyon 07, France

来源：

CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS | 2008年

关键词：

GridRPC; Fault tolerant; Failure detector; Checkpoint; Distributed algorithm;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.

引用

页码：484 / 491

页数：8

共 50 条

[21] Fock matrix construction using Ninf-G GridRPC programming middleware
Umeda, Hiroaki
Inadomi, Yuichi
Watanabe, Toshio
Ishimoto, Takayoshi
Nagashima, Umpei
[J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2006, 232 : 298 - 298
[22] Fault tolerance of hierarchical cubic networks based on cluster fault pattern
Lv, Mengjie
Fan, Weibei
Dong, Hui
Wang, Guijuan
[J]. COMPUTER JOURNAL, 2024, : 2890 - 2897
[23] The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems
Yeh, CH
[J]. 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2003, : 61 - 68
[24] Standardized Data Management in GridRPC Environments
Caniou, Yves
Le Mahec, Gael
Caron, Eddy
Nakada, Hidemoto
[J]. 2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 501 - 508
[25] On conditional fault tolerance and diagnosability of hierarchical cubic networks
Zhou, Shuming
Song, Sulin
Yang, Xiaoxue
Chen, Lanxiang
[J]. THEORETICAL COMPUTER SCIENCE, 2016, 609 : 421 - 433
[26] CUMULVS: Extending a generic steering and visualization middleware for application fault-tolerance
Papadopoulos, PM
Kohl, JA
Semeraro, BD
[J]. PROCEEDINGS OF THE THIRTY-FIRST HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL VII: SOFTWARE TECHNOLOGY TRACK, 1998, : 127 - 136
[27] Deployment of a Hierarchical Middleware
Caron, Eddy
Depardon, Benjamin
Desprez, Frederic
[J]. EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 343 - 354
[28] Adding fault-tolerance to a hierarchical DRE system
Rubel, Paul
Loyall, Joseph
Schantz, Richard
Gillen, Matthew
[J]. DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS, PROCEEDINGS, 2006, 4025 : 303 - 308
[29] FA fault tolerance schemes in hierarchical mobile IP
Ye, Minhua
Liu, Yu
Zhang, Huimin
[J]. Jisuanji Gongcheng/Computer Engineering, 2003, 29 (07):
[30] Fault-tolerance schemes for hierarchical mesh networks
Zurawski, J
Wang, DJ
[J]. PDCAT 2005: Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies, Proceedings, 2005, : 498 - 502

← 1 2 3 4 5 →