Fault Tolerance Management for a Hierarchical GridRPC Middleware

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Desprez, Frederic [1 ]
机构
[1] UCBL, INRIA, CNRS ENS Lyon, LIP ENS Lyon,UMR 5668, F-69364 Lyon 07, France
关键词
GridRPC; Fault tolerant; Failure detector; Checkpoint; Distributed algorithm;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The GridRPC model is well suited for high performance computing on grids thanks to efficiently solving most of the issues raised by geographically and administratively split resources. Because of large scale, long range networks and heterogeneity, Grids are extremely prone to failures. GridRPC middleware are usually managing failures by using 1) TCP or other link network layer provided failure detector, 2) automatic checkpoints of sequential jobs and 3) a centralized stable agent to perform scheduling. Most recent developments have provided some new mechanisms like the optimal Chandra & Toueg & Aguillera failure detector, most numerical libraries now providing their own optimized checkpoint routine and distributed scheduling GridRPC architectures. In this paper we aim at adapting to these novelties by providing the first implementation and evaluation in a grid system of the optimal fault detector, a novel and simple checkpoint API allowing to manage both service provided checkpoint and automatic checkpoint (even for parallel services) and a scheduling hierarchy recovery algorithm tolerating several simultaneous failures. All those mechanisms are implemented and evaluated on a real grid in the DIET middleware.
引用
收藏
页码:484 / 491
页数:8
相关论文
共 50 条
  • [41] Towards middleware for fault-tolerance in distributed real-time and embedded systems
    Balasubramanian, Jaiganesh
    Gokhale, Aniruddha
    Schmidt, Douglas C.
    Wang, Nanbor
    [J]. DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS, 2008, 5053 : 72 - +
  • [42] Decentralized Fault Tolerance Mechanism for Intelligent IoT/M2M Middleware
    Su, Penn H.
    Shih, Chi-Sheng
    Hsu, Jane Yung-Jen
    Lin, Kwei-Jay
    Wang, Yu-Chung
    [J]. 2014 IEEE WORLD FORUM ON INTERNET OF THINGS (WF-IOT), 2014, : 45 - 50
  • [43] Design of an International Sparse Linear Algebra Expert System Relying on an OGF GridRPC Data Management GridRPC System
    Camillo, Frederic
    Caniou, Yves
    Depardon, Benjamin
    Le Mahec, Gael
    Guivarch, Ronan
    [J]. 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTING AND CONVERGENCE TECHNOLOGY (ICCCT2012), 2012, : 176 - 181
  • [44] A fault tolerant object management framework based on middleware for dynamic reconfiguration
    Guo, CH
    Zhou, MH
    Huang, SW
    Wang, HM
    [J]. PROCEEDINGS OF THE 11TH JOINT INTERNATIONAL COMPUTER CONFERENCE, 2005, : 766 - 771
  • [45] Enhanced fault tolerance in biomimetic hierarchical materials: A simulation study
    Hosseini, Seyyed Ahmad
    Moretti, Paolo
    Zaiser, Michael
    [J]. PHYSICAL REVIEW MATERIALS, 2023, 7 (05)
  • [46] Component conditional fault tolerance of hierarchical folded cubic networks
    Sun, Xueli
    Fan, Jianxi
    Cheng, Baolei
    Liu, Zhao
    Yu, Jia
    [J]. THEORETICAL COMPUTER SCIENCE, 2021, 883 : 44 - 58
  • [47] A hierarchical byzantine fault tolerance consensus protocol for the Internet of Things
    Guo, Rongxin
    Guo, Zhenping
    Lin, Zerui
    Jiang, Wenxian
    [J]. HIGH-CONFIDENCE COMPUTING, 2024, 4 (03):
  • [48] Enhancing the fault tolerance of workflow management systems
    Alonso, G
    Hagen, C
    Agrawal, D
    El Abbadi, A
    Mohan, C
    [J]. IEEE CONCURRENCY, 2000, 8 (03): : 74 - 81
  • [49] Adaptive load balancing and fault tolerance in push-based information delivery middleware service
    Ma, ZF
    Feng, BQ
    [J]. ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 733 - 738
  • [50] Intrusion tolerance in distributed middleware
    Zbib, R
    Anjum, F
    Ghosh, A
    Umar, A
    [J]. INFORMATION SYSTEMS FRONTIERS, 2004, 6 (01) : 67 - 75