A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead

被引:0
|
作者
Yang, Jin-Min [1 ]
机构
[1] Hunan Univ, Dept Comp Sci & Elect Engn, Changsha, Hunan, Peoples R China
关键词
High performance computing; fault tolerance; rollback recovery; execution model; message logging;
D O I
10.1109/CLUSTER.2016.64
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications, especially in large-scale distributed systems. Its high overhead arises from its conservation in the assumption on program execution model. This paper identifies the influence of non-deterministic message delivery on the correct outcome of a process, and then gives a scheme to relax the constraints from the piecewise deterministic execution model. Subsequently, a lightweight implementation of causal message logging is proposed to decrease the overhead of piggybacking and rolling forward. The experimental results of 3 NAS NPB2.3 benchmarks show that the proposed scheme achieves a significant improvement in the overhead reduction.
引用
收藏
页码:392 / 401
页数:10
相关论文
共 50 条
  • [41] A low overhead fault tolerant coherence protocol for CMP architectures
    Fernandez-Pascual, Ricardo
    Garcia, Jose M.
    Acacio, Manuel E.
    Duato, Jose
    THIRTEENTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 2007, : 157 - +
  • [42] Lightweight Fault Tolerance in Pregel-Like Systems
    Yan, Da
    Cheng, James
    Chen, Hongzhi
    Long, Cheng
    Bangalore, Purushotham
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [43] A Secure and Lightweight Protocol for Message Authentication in Wireless Sensor Networks
    Kar, Jayaprakash
    Naik, Kshirasagar
    Abdelkader, Tamer
    IEEE SYSTEMS JOURNAL, 2021, 15 (03): : 3808 - 3819
  • [44] Enhanced Sender-Based Message Logging for Reducing Forced Checkpointing Overhead in Distributed Systems
    Ahn, Jinho
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (09) : 1500 - 1505
  • [45] Lightweight Consistent Recovery Algorithm for Sender-Based Message Logging in Distributed Systems
    Ahn, Jinho
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (08) : 1712 - 1715
  • [46] On combining fault tolerance and partial replication with causal consistency
    van der Linde, Albert
    Serra, Diogo
    Leitao, Joao
    Preguica, Nuno
    7TH WORKSHOP ON PRINCIPLES AND PRACTICE OF CONSISTENCY FOR DISTRIBUTED DATA (PAPOC '20), 2020,
  • [47] An efficient centralized algorithm ensuring consistent recovery in causal message logging with independent checkpointing
    Ahn, J
    Min, S
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2004, E87D (04): : 1039 - 1043
  • [48] Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications
    Meneses, Esteban
    Kale, Laxmikant V.
    Bronevetsky, Greg
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 281 - 289
  • [49] Improved message logging versus Improved coordinated checkpointing for fault tolerant MPI
    Lemarinier, P
    Bouteiller, A
    Herault, T
    Krawezik, G
    Cappello, F
    2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, : 115 - 124
  • [50] A fault tolerance protocol for uploads: Design and evaluation
    Cheung, L
    Chou, CF
    Golubchik, L
    Yang, Y
    PARALLEL AND DISTRIBUTED PROCESSING AND APPLICATIONS, PROCEEDINGS, 2004, 3358 : 136 - 145