A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead

被引:0
|
作者
Yang, Jin-Min [1 ]
机构
[1] Hunan Univ, Dept Comp Sci & Elect Engn, Changsha, Hunan, Peoples R China
关键词
High performance computing; fault tolerance; rollback recovery; execution model; message logging;
D O I
10.1109/CLUSTER.2016.64
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications, especially in large-scale distributed systems. Its high overhead arises from its conservation in the assumption on program execution model. This paper identifies the influence of non-deterministic message delivery on the correct outcome of a process, and then gives a scheme to relax the constraints from the piecewise deterministic execution model. Subsequently, a lightweight implementation of causal message logging is proposed to decrease the overhead of piggybacking and rolling forward. The experimental results of 3 NAS NPB2.3 benchmarks show that the proposed scheme achieves a significant improvement in the overhead reduction.
引用
收藏
页码:392 / 401
页数:10
相关论文
共 50 条
  • [21] Extending the TOKENCMP cache coherence protocol for low overhead fault tolerance in CMP architectures
    Fernandez-Pascual, Ricardo
    Garcia, Jose M.
    Acacio, Manuel E.
    Duato, Jose
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (08) : 1044 - 1056
  • [22] On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
    Ropars, Thomas
    Guermouche, Amina
    Ucar, Bora
    Meneses, Esteban
    Kale, Laxmikant V.
    Cappello, Franck
    EURO-PAR 2011 PARALLEL PROCESSING, PT 1, 2011, 6852 : 567 - 578
  • [23] Lightweight Cooperative Logging for Fault Replication in Concurrent Programs
    Machado, Nuno
    Romano, Paolo
    Rodrigues, Luis
    2012 42ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2012,
  • [24] A non-blocking recovery algorithm for causal message logging
    Mitchell, JR
    Garg, VK
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 3 - 9
  • [25] Scalable causal message logging for wide-area environments
    Bhatia, K
    Marzullo, K
    Alvisi, L
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2003, 15 (10): : 873 - 889
  • [26] Optimization on OLSR protocol for lower routing overhead
    Xue, Yong
    Jiang, Hong
    Hu, Hui
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 723 - 730
  • [27] Multimedia checkpoint protocol with lower recovery overhead
    Osada, S
    Higaki, H
    AINA 2003: 17TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, 2003, : 598 - 601
  • [28] Efficient garbage collection schemes for causal message logging with independent checkpointing
    Ahn, J
    Min, SG
    Hwang, CS
    Yu, HC
    JOURNAL OF SUPERCOMPUTING, 2002, 22 (02): : 175 - 196
  • [29] Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing
    Jinho Ahn
    Sung-Gi Min
    Chong-Sun Hwang
    Heonchang Yu
    The Journal of Supercomputing, 2002, 22 : 175 - 196
  • [30] Correlated Set Coordination in Fault Tolerant Message Logging Protocols
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Dongarra, Jack J.
    EURO-PAR 2011 PARALLEL PROCESSING, PT 2, 2011, 6853 : 51 - 64