An efficient computing-checkpoint based coordinated checkpoint algorithm

被引:0
|
作者
Men Chaoguang [1 ]
Wang Dongsheng
Zhao Yunlong
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[2] Harbin Engn Univ, Res Ctr High Dependabil Comp Technol, Harbin 150001, Heilongjiang, Peoples R China
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, the concept of "computing checkpoint" is introduced, and then an efficient coordinated checkpoint algorithm is proposed. The algorithm combines the two approaches of reducing the overhead associated with coordinated checkpointing, which one is to minimize the processes which take checkpoints and the other is to make the checkpointing process non-blocking. Through piggybacking the information including which processes have taken new checkpoint in the broadcast committing message, the checkpoint sequence number of every process can be kept consistent in all processes, so that the unnecessary checkpoints and orphan messages can be avoided in the future running. Evaluation result shows that the number of redundant computing checkpoints is less than 1/10 of the number of tentative checkpoints. Analyses and experiments show that the overhead of our algorithm is lower than that of other coordinated checkpoint algorithms.
引用
收藏
页码:99 / 109
页数:11
相关论文
共 50 条
  • [21] Combining Coordinated and Uncoordinated Checkpoint in Pessimistic Sender-Based Message Logging
    Aminian, Mehdi
    Akbari, Mohammad K.
    Javadi, Bahman
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2006, 6 (04): : 156 - 161
  • [22] Coordinated versus uncoordinated checkpoint recovery for network-on-chip based systems
    Rusu, Claudia
    Grecu, Cristian
    Anghel, Lorena
    DELTA 2008: FOURTH IEEE INTERNATIONAL SYMPOSIUM ON ELECTRONIC DESIGN, TEST AND APPLICATIONS, PROCEEDINGS, 2008, : 32 - +
  • [23] Linear-time algorithm for computing minimum checkpoint sets for simulation-based verification of HDL programs
    Dubrova, E
    2005 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), VOLS 1-6, CONFERENCE PROCEEDINGS, 2005, : 2212 - 2215
  • [24] An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
    Levy, Scott
    Ferreira, Kurt B.
    PROCEEDINGS OF THE ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE (FTXS'16), 2016, : 35 - 42
  • [25] Hierarchical composition of coordinated checkpoint with pessimistic message logging
    Ndiaye, Ndeye Massata
    Sens, Pierre
    Thiare, Ousmane
    2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 752 - 756
  • [26] A Unit-based Checkpoint Algorithm Supporting Fault Tolerance
    Li, Hong-liang
    2013 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (ICCSAI 2013), 2013, : 381 - 385
  • [27] Two-level checkpoint algorithm based on dynamic grouping
    Liu G.-L.
    Chen S.-Y.
    Xu G.-X.
    Chang G.-H.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2011, 39 (02): : 141 - 147
  • [28] Checkpoint-recovery for mobile computing systems
    Morita, Y
    Higaki, H
    21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS, PROCEEDINGS, 2001, : 479 - 484
  • [29] Efficient checkpoint/Restart of CUDA applications
    Nukada, Akira
    Suzuki, Taichiro
    Matsuoka, Satoshi
    PARALLEL COMPUTING, 2023, 116
  • [30] An on-line algorithm for checkpoint placement
    Ziv, A
    Bruck, J
    SEVENTH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING, PROCEEDINGS, 1996, : 274 - 283