FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

被引:4
|
作者
Abdelhafidi, Zohra [1 ]
Djoudi, Mohamed [1 ]
Lagraa, Nasreddine [1 ]
Yagoubi, Mohamed Bachir [1 ]
机构
[1] Amar Telidji Univ, Comp Sci & Math Lab, Laghouat 03000, Algeria
关键词
Distributed systems; Fault tolerance; Coordinated checkpointing; Dependency; Popular process; GLOBAL-SNAPSHOT ALGORITHMS; LARGE-SCALE; ROLLBACK-RECOVERY; MODEL; LOGP;
D O I
10.1007/s00224-014-9599-8
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a Fast Non-Blocking coordinated checkpointing protocol for distributed systems with the aim of minimizing the number of requests and mutable checkpoints while reducing the checkpointing latency. Our protocol relies on two mechanisms; the first one is piggybacking dependency information on computation and reply message, thereby, tracking direct, transitive and hidden dependencies among processes. The second one is popular processes; due to the communication between processes, it is more desirable that the checkpointing procedure is initiated by popular processes having more dependency information. In fact, this way may reduce the checkpointing latency and the likelihood of checkpointing halting caused by fault occurrence. We also present a simulation study that compares our protocol to CSNB protocol (Cao and Singhal Non-Blocking) and CSB.protocol (Cao and Singhal Blocking).
引用
收藏
页码:397 / 425
页数:29
相关论文
共 50 条
  • [31] Design and analysis of an efficient algorithm for coordinated checkpointing in distributed systems
    Cao, JN
    Jia, WJ
    Jia, XH
    Cheung, TY
    ADVANCES IN PARALLEL AND DISTRIBUTED COMPUTING - PROCEEDINGS, 1997, : 261 - 268
  • [32] Software Model Checking for Distributed Systems with Selector-Based, Non-blocking Communication
    Artho, Cyrille
    Hagiya, Masami
    Potter, Richard
    Tanabe, Yoshinori
    Weitl, Franz
    Yamamoto, Mitsuharu
    2013 28TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2013, : 169 - 179
  • [33] Non-Blocking Atomic Commitment Algorithm in Asynchronous Distributed Systems with Unreliable Failure Detectors
    Park, Sung-Hoon
    Lee, Jea-Yep
    Yu, Su-Chang
    PROCEEDINGS OF THE 2013 10TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, 2013, : 33 - 38
  • [34] Logging based coordinated checkpointing in mobile distributed computing systems
    Kumar, L
    Kumar, P
    Chauhan, RK
    IETE JOURNAL OF RESEARCH, 2005, 51 (06) : 485 - 490
  • [35] A fully non-blocking reliable multicast protocol with total ordering
    Iyer, M
    Siu, KY
    1977 IEEE INTERNATIONAL PERFORMANCE, COMPUTING AND COMMUNICATIONS CONFERENCE, 1997, : 378 - 384
  • [36] A High Performance Asynchronous Non-blocking Data Communication Protocol
    Huang, Guimin
    Zheng, Zhi
    Zhou, Ya
    PROCEEDINGS OF 2016 IEEE 7TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2016), 2016, : 269 - 272
  • [37] A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling
    Song, Fengguang
    Dongarra, Jack
    Moore, Shirley
    COMPUTATIONAL SCIENCE - ICCS 2009, PART I, 2009, 5544 : 195 - 204
  • [38] Non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems
    Kumar, P
    Kumar, L
    Chauhan, RK
    Gupta, VK
    2005 IEEE INTERNATIONAL CONFERENCE ON PERSONAL WIRELESS COMMUNICATIONS, 2005, : 491 - 495
  • [39] Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems
    Kumar, Parveen
    Garg, Rachit
    INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, 2011, 2 (01) : 1 - 13
  • [40] Solving Non-Blocking Atomic Commitment Problem in Asynchronous Distributed Systems with Unreliable Failure Detectors
    Park, Sung-Hoon
    Lee, Seon-Hyong
    CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, 2012, 310 : 94 - 102