Fault Tolerance Management for a Hierarchical GridRPC Middleware

被引：0

作者：

Bouteiller, Aurelien ^{[1
]}

Desprez, Frederic ^{[1
]}

机构：

[1] UCBL, INRIA, CNRS ENS Lyon, LIP ENS Lyon,UMR 5668, F-69364 Lyon 07, France

来源：

CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS | 2008年

关键词：

GridRPC; Fault tolerant; Failure detector; Checkpoint; Distributed algorithm;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.

引用

页码：484 / 491

页数：8

共 50 条

[1] ALL-IN-ONE GRAPHICAL TOOL FOR THE MANAGEMENT OF A DIET GRIDRPC MIDDLEWARE
Caron, Eddy
Desprez, Frederic
Loureiro, David
[J]. GRID AND SERVICES EVOLUTION, 2009, : 169 - 187
[2] Application fault tolerance with armore middleware
Kalbarczyk, Z
Iyer, RK
Wang, L
[J]. IEEE INTERNET COMPUTING, 2005, 9 (02) : 28 - 37
[3] Fault-Tolerance in XJAF Agent Middleware
Ivanovic, Mirjana
Ivkovic, Jovana
Vidakovic, Milan
Luburic, Nikola
Badica, Costin
[J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2016, PT II, 2016, 9876 : 25 - 34
[4] Fault-tolerance in Universal Middleware Bridge
Moon, Kyung-Deok
Park, Jun Hee
Kim, K. H.
Zheng, Liangchen
Zhou, Qian
[J]. ISORC 2008: 11TH IEEE SYMPOSIUM ON OBJECT/COMPONENT/SERVICE-ORIENTED REAL-TIME DISTRIBUTED COMPUTING - PROCEEDINGS, 2008, : 471 - +
[5] Fault-Tolerance of Hierarchical Power Management in Data Center
Li, Jianxiang
Lv, Yinan
Kong, Xiangzhen
[J]. INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS II, PTS 1-3, 2013, 336-338 : 2555 - 2558
[6] A Scalability Hierarchical Fault Tolerance Strategy: Community Fault Tolerance
Chen, Jianping
Lu, Yao
Comsa, Ioan
Kuonen, Pierre
[J]. PROCEEDINGS OF THE 2014 20TH INTERNATIONAL CONFERENCE ON AUTOMATION AND COMPUTING (ICAC'14), 2014, : 212 - +
[7] Fault tolerance using standard reflexive middleware mechanisms
Bennani, Mohamed Taha
[J]. PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND NETWORKS, 2007, : 359 - 366
[8] Flexible fault tolerance in configurable middleware for embedded systems
Dorow, K
[J]. 27TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2003, : 563 - 569
[9] Hierarchical fault tolerance for nanoscale memories
Jeffery, Casey A.
Figueiredo, Renato J. O.
[J]. IEEE TRANSACTIONS ON NANOTECHNOLOGY, 2006, 5 (04) : 407 - 414
[10] A Mixed-Method Approach to Fault Tolerance and Management for Resilient Hierarchical Routing
Ismail, Ahmed
Seddik, Karim
[J]. 2014 6TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2014, : 294 - 301

← 1 2 3 4 5 →