Fault Tolerance Management for a Hierarchical GridRPC Middleware

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Desprez, Frederic [1 ]
机构
[1] UCBL, INRIA, CNRS ENS Lyon, LIP ENS Lyon,UMR 5668, F-69364 Lyon 07, France
关键词
GridRPC; Fault tolerant; Failure detector; Checkpoint; Distributed algorithm;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.
引用
收藏
页码:484 / 491
页数:8
相关论文
共 50 条
  • [1] ALL-IN-ONE GRAPHICAL TOOL FOR THE MANAGEMENT OF A DIET GRIDRPC MIDDLEWARE
    Caron, Eddy
    Desprez, Frederic
    Loureiro, David
    [J]. GRID AND SERVICES EVOLUTION, 2009, : 169 - 187
  • [2] Application fault tolerance with armore middleware
    Kalbarczyk, Z
    Iyer, RK
    Wang, L
    [J]. IEEE INTERNET COMPUTING, 2005, 9 (02) : 28 - 37
  • [3] Fault-Tolerance in XJAF Agent Middleware
    Ivanovic, Mirjana
    Ivkovic, Jovana
    Vidakovic, Milan
    Luburic, Nikola
    Badica, Costin
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2016, PT II, 2016, 9876 : 25 - 34
  • [4] Fault-tolerance in Universal Middleware Bridge
    Moon, Kyung-Deok
    Park, Jun Hee
    Kim, K. H.
    Zheng, Liangchen
    Zhou, Qian
    [J]. ISORC 2008: 11TH IEEE SYMPOSIUM ON OBJECT/COMPONENT/SERVICE-ORIENTED REAL-TIME DISTRIBUTED COMPUTING - PROCEEDINGS, 2008, : 471 - +
  • [5] Fault-Tolerance of Hierarchical Power Management in Data Center
    Li, Jianxiang
    Lv, Yinan
    Kong, Xiangzhen
    [J]. INDUSTRIAL INSTRUMENTATION AND CONTROL SYSTEMS II, PTS 1-3, 2013, 336-338 : 2555 - 2558
  • [6] A Scalability Hierarchical Fault Tolerance Strategy: Community Fault Tolerance
    Chen, Jianping
    Lu, Yao
    Comsa, Ioan
    Kuonen, Pierre
    [J]. PROCEEDINGS OF THE 2014 20TH INTERNATIONAL CONFERENCE ON AUTOMATION AND COMPUTING (ICAC'14), 2014, : 212 - +
  • [7] Fault tolerance using standard reflexive middleware mechanisms
    Bennani, Mohamed Taha
    [J]. PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING AND NETWORKS, 2007, : 359 - 366
  • [8] Flexible fault tolerance in configurable middleware for embedded systems
    Dorow, K
    [J]. 27TH ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, PROCEEDINGS, 2003, : 563 - 569
  • [9] Hierarchical fault tolerance for nanoscale memories
    Jeffery, Casey A.
    Figueiredo, Renato J. O.
    [J]. IEEE TRANSACTIONS ON NANOTECHNOLOGY, 2006, 5 (04) : 407 - 414
  • [10] A Mixed-Method Approach to Fault Tolerance and Management for Resilient Hierarchical Routing
    Ismail, Ahmed
    Seddik, Karim
    [J]. 2014 6TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2014, : 294 - 301