Correlated Set Coordination in Fault Tolerant Message Logging Protocols

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Herault, Thomas [1 ]
Bosilca, George [1 ]
Dongarra, Jack J. [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
引用
收藏
页码:51 / 64
页数:14
相关论文
共 50 条
  • [41] A Secured and Fault-Tolerant Multipath Routing Protocols for WMN
    Rawat, Paramjeet
    Soam, Meenakshi
    Malik, Suraj
    COMPUTATIONAL INTELLIGENCE AND INFORMATION TECHNOLOGY, 2011, 250 : 209 - +
  • [42] Efficient Model Checking of Fault-Tolerant Distributed Protocols
    Bokor, Peter
    Kinder, Johannes
    Serafini, Marco
    Suri, Neeraj
    2011 IEEE/IFIP 41ST INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2011, : 73 - 84
  • [43] Fault-tolerant protocols for scalable distributed data structures
    Sapiecha, Krzysztof
    Lukawski, Grzegorz
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2006, 3911 : 1018 - 1025
  • [44] Passivity based fault tolerant quantized control for coordination
    Chen, Shun
    Ho, Daniel W. C.
    Li, Jinsha
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2016, 353 (12): : 2690 - 2707
  • [45] Structured coordination spaces for fault tolerant mobile agents
    Iliasov, Alexei
    Romanovsky, Alexander
    ADVANCED TOPICS IN EXCEPTION HANDLING TECHNIQUES, 2006, 4119 : 181 - 199
  • [46] Communication protocols for a fault-tolerant automated highway system
    Godbole, DN
    Lygeros, J
    Singh, E
    Deshpande, A
    Lindsey, AE
    IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, 2000, 8 (05) : 787 - 800
  • [47] An Extensible Framework for Implementing Byzantine Fault-Tolerant Protocols
    Gogada, Hanish
    Meling, Hein
    Jehl, Leander
    Olsen, John Ingve
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 121 - 124
  • [48] Node-to-set and set-to-set cluster fault tolerant routing in hypercubes
    Gu, QP
    Peng, ST
    PARALLEL COMPUTING, 1998, 24 (08) : 1245 - 1261
  • [49] Towards fault tolerant and synchronous multicast protocols for distributed systems
    Cheng, WCH
    Kutti, S
    Jia, XH
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS - PROCEEDINGS OF THE ISCA 9TH INTERNATIONAL CONFERENCE, VOLS I AND II, 1996, : 418 - 425
  • [50] Node-to-set and set-to-set cluster fault tolerant routing in hypercubes
    Univ of Aizu, Fukushima, Japan
    Parallel Comput, 8 (1245-1261):