FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems

被引:4
|
作者
Abdelhafidi, Zohra [1 ]
Djoudi, Mohamed [1 ]
Lagraa, Nasreddine [1 ]
Yagoubi, Mohamed Bachir [1 ]
机构
[1] Amar Telidji Univ, Comp Sci & Math Lab, Laghouat 03000, Algeria
关键词
Distributed systems; Fault tolerance; Coordinated checkpointing; Dependency; Popular process; GLOBAL-SNAPSHOT ALGORITHMS; LARGE-SCALE; ROLLBACK-RECOVERY; MODEL; LOGP;
D O I
10.1007/s00224-014-9599-8
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a Fast Non-Blocking coordinated checkpointing protocol for distributed systems with the aim of minimizing the number of requests and mutable checkpoints while reducing the checkpointing latency. Our protocol relies on two mechanisms; the first one is piggybacking dependency information on computation and reply message, thereby, tracking direct, transitive and hidden dependencies among processes. The second one is popular processes; due to the communication between processes, it is more desirable that the checkpointing procedure is initiated by popular processes having more dependency information. In fact, this way may reduce the checkpointing latency and the likelihood of checkpointing halting caused by fault occurrence. We also present a simulation study that compares our protocol to CSNB protocol (Cao and Singhal Non-Blocking) and CSB.protocol (Cao and Singhal Blocking).
引用
收藏
页码:397 / 425
页数:29
相关论文
共 50 条
  • [1] FNB: Fast Non-Blocking Coordinated Checkpointing Protocol for Distributed Systems
    Zohra Abdelhafidi
    Mohamed Djoudi
    Nasreddine Lagraa
    Mohamed Bachir Yagoubi
    Theory of Computing Systems, 2015, 57 : 397 - 425
  • [2] Non-blocking coordinated checkpointing protocol for distributed simulation system
    Liu, Yun-Sheng
    Huang, Jian
    Zha, Ya-Bing
    Xitong Fangzhen Xuebao / Journal of System Simulation, 2007, 19 (01): : 71 - 74
  • [3] A non-blocking Checkpointing algorithm for distributed systems
    Guoliang L.
    Shuyu C.
    Xiaoqin Z.
    International Journal of Digital Content Technology and its Applications, 2011, 5 (07) : 230 - 238
  • [4] A new non-blocking synchronous checkpointing scheme for distributed systems
    Gupta, B
    Rahimi, S
    Naskar, P
    Proceedings of the ISCA 20th International Conference on Computers and Their Applications, 2005, : 26 - 31
  • [5] Design and Modeling of a Non-blocking Checkpointing System
    Sato, Kento
    Mohror, Kathryn
    Moody, Adam
    Gamblin, Todd
    de Supinski, Bronis R.
    Maruyama, Naoya
    Matsuoka, Satoshi
    2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
  • [6] Non-blocking atomic commitment in distributed systems: A tutorial based on a generic protocol
    Raynal, M
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2000, 15 (02): : 77 - 86
  • [7] Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols
    Buntinas, Darius
    Coti, Camille
    Herault, Thomas
    Lemarinier, Pierre
    Pilard, Laurence
    Rezmerita, Ala
    Rodriguez, Eric
    Cappello, Franck
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2008, 24 (01): : 73 - 84
  • [8] On coordinated checkpointing in distributed systems
    Cao, GH
    Singhal, M
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (12) : 1213 - 1225
  • [9] Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing
    Men, C
    Yang, XZ
    PARALLEL AND DISTRIBUTED COMPUTING: APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2004, 3320 : 570 - 576
  • [10] On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems
    Cao, GH
    Singhal, M
    1998 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - PROCEEDINGS, 1998, : 37 - 44