Fault Tolerance Management for a Hierarchical GridRPC Middleware

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Desprez, Frederic [1 ]
机构
[1] UCBL, INRIA, CNRS ENS Lyon, LIP ENS Lyon,UMR 5668, F-69364 Lyon 07, France
关键词
GridRPC; Fault tolerant; Failure detector; Checkpoint; Distributed algorithm;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.
引用
收藏
页码:484 / 491
页数:8
相关论文
共 50 条
  • [21] Fault tolerance of hierarchical cubic networks based on cluster fault pattern
    Lv, Mengjie
    Fan, Weibei
    Dong, Hui
    Wang, Guijuan
    [J]. COMPUTER JOURNAL, 2024, : 2890 - 2897
  • [22] Fock matrix construction using Ninf-G GridRPC programming middleware
    Umeda, Hiroaki
    Inadomi, Yuichi
    Watanabe, Toshio
    Ishimoto, Takayoshi
    Nagashima, Umpei
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2006, 232 : 298 - 298
  • [23] The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems
    Yeh, CH
    [J]. 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2003, : 61 - 68
  • [24] CUMULVS: Extending a generic steering and visualization middleware for application fault-tolerance
    Papadopoulos, PM
    Kohl, JA
    Semeraro, BD
    [J]. PROCEEDINGS OF THE THIRTY-FIRST HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL VII: SOFTWARE TECHNOLOGY TRACK, 1998, : 127 - 136
  • [25] Deployment of a Hierarchical Middleware
    Caron, Eddy
    Depardon, Benjamin
    Desprez, Frederic
    [J]. EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 343 - 354
  • [26] On conditional fault tolerance and diagnosability of hierarchical cubic networks
    Zhou, Shuming
    Song, Sulin
    Yang, Xiaoxue
    Chen, Lanxiang
    [J]. THEORETICAL COMPUTER SCIENCE, 2016, 609 : 421 - 433
  • [27] A Middleware Approach to Achieving Fault Tolerance of Kahn Process Networks on Networks on Chips
    Derin, Onur
    Diken, Erkan
    Fiorin, Leandro
    [J]. INTERNATIONAL JOURNAL OF RECONFIGURABLE COMPUTING, 2011, 2011
  • [28] Adding fault-tolerance to a hierarchical DRE system
    Rubel, Paul
    Loyall, Joseph
    Schantz, Richard
    Gillen, Matthew
    [J]. DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS, PROCEEDINGS, 2006, 4025 : 303 - 308
  • [29] FA fault tolerance schemes in hierarchical mobile IP
    Ye, Minhua
    Liu, Yu
    Zhang, Huimin
    [J]. Jisuanji Gongcheng/Computer Engineering, 2003, 29 (07):
  • [30] Fault-tolerance schemes for hierarchical mesh networks
    Zurawski, J
    Wang, DJ
    [J]. PDCAT 2005: Sixth International Conference on Parallel and Distributed Computing, Applications and Technologies, Proceedings, 2005, : 498 - 502