A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead

被引:0
|
作者
Yang, Jin-Min [1 ]
机构
[1] Hunan Univ, Dept Comp Sci & Elect Engn, Changsha, Hunan, Peoples R China
关键词
High performance computing; fault tolerance; rollback recovery; execution model; message logging;
D O I
10.1109/CLUSTER.2016.64
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications, especially in large-scale distributed systems. Its high overhead arises from its conservation in the assumption on program execution model. This paper identifies the influence of non-deterministic message delivery on the correct outcome of a process, and then gives a scheme to relax the constraints from the piecewise deterministic execution model. Subsequently, a lightweight implementation of causal message logging is proposed to decrease the overhead of piggybacking and rolling forward. The experimental results of 3 NAS NPB2.3 benchmarks show that the proposed scheme achieves a significant improvement in the overhead reduction.
引用
收藏
页码:392 / 401
页数:10
相关论文
共 50 条
  • [1] The relative overhead of piggybacking in causal message logging protocols
    Bhatia, K
    Marzullo, K
    Alvisi, L
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 348 - 353
  • [2] Garbage collection in a causal message logging protocol
    Chung, KS
    Yu, HC
    Park, S
    HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2005, 3726 : 123 - 132
  • [3] A lightweight message logging scheme for fault tolerant MPI
    Lee, I
    Yeom, HY
    Park, T
    Park, H
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 397 - 404
  • [4] Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications
    Meneses, Esteban
    HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 204 - 218
  • [5] A causal message logging protocol with asynchronous checkpointing for distributed systems
    Ahn, J
    Kim, K
    Hwang, C
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 523 - 528
  • [6] Efficient causal message logging protocol integrated with asynchronous checkpointing
    Ahn, Jinho
    WSEAS: ADVANCES ON APPLIED COMPUTER AND APPLIED COMPUTATIONAL SCIENCE, 2008, : 300 - 305
  • [7] A causal message logging protocol for mobile nodes in mobile computing systems
    Ahn, JH
    Min, SG
    Hwang, CS
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2004, 20 (04): : 663 - 686
  • [8] Message fragment based causal message logging
    Ci, Yi-Wei
    Zhang, Zhan
    Zuo, De-Cheng
    Yang, Xiao-Zong
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (11) : 915 - 921
  • [9] Applying Message Logging to Support Fault-Tolerance of SOA Systems
    Danilecki, Arkadiusz
    Holenko, Mateusz
    Kobusinska, Anna
    Szychowiak, Michal
    Zierhoffer, Piotr
    FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2013, 38 (03) : 145 - 158
  • [10] An efficient algorithm for causal message logging
    Lee, B
    Park, T
    Yeom, HY
    Cho, Y
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 19 - 25