Correlated Set Coordination in Fault Tolerant Message Logging Protocols

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Herault, Thomas [1 ]
Bosilca, George [1 ]
Dongarra, Jack J. [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
引用
收藏
页码:51 / 64
页数:14
相关论文
共 50 条
  • [21] Fault-tolerant simulation of population protocols
    Giuseppe A. Di Luna
    Paola Flocchini
    Taisuke Izumi
    Tomoko Izumi
    Nicola Santoro
    Giovanni Viglietta
    Distributed Computing, 2020, 33 : 561 - 578
  • [22] Verification of Fault-Tolerant Protocols with Sally
    Dutertre, Bruno
    Jovanovic, Dejan
    Navas, Jorge A.
    NASA FORMAL METHODS, NFM 2018, 2018, 10811 : 113 - 120
  • [23] Fault-tolerant simulation of population protocols
    Di Luna, Giuseppe A.
    Flocchini, Paola
    Izumi, Taisuke
    Izumi, Tomoko
    Santoro, Nicola
    Viglietta, Giovanni
    DISTRIBUTED COMPUTING, 2020, 33 (06) : 561 - 578
  • [24] A Sequentialization Procedure for Fault-Tolerant Protocols
    Dragoi, Cezara
    Pronesti, Patricio Inzaghi
    VERIFIED SOFTWARE. THEORIES, TOOLS AND EXPERIMENTS, VSTTE 2022, 2023, 13800 : 52 - 71
  • [25] FAULT-TOLERANT DECENTRALIZED COMMIT PROTOCOLS
    YUAN, SM
    AGRAWALA, AK
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1991, 13 (03) : 299 - 311
  • [26] An ACL for specifying fault-tolerant protocols
    Dragoni, Nicola
    Gaspari, Mauro
    Guidi, Davide
    APPLIED ARTIFICIAL INTELLIGENCE, 2007, 21 (4-5) : 361 - 381
  • [27] Temporal Verification of Fault-Tolerant Protocols
    Fisher, Michael
    Konev, Boris
    Lisitsa, Alexei
    METHODS, MODELS AND TOOLS FOR FAULT TOLERANCE, 2009, 5454 : 44 - 56
  • [28] Set-to-set fault tolerant routing in hypercubes
    Gu, QP
    Okawa, S
    Peng, ST
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 1996, E79A (04) : 483 - 488
  • [29] Fault-tolerant message routing for multiprocessors
    Zakrevski, L
    Karpovsky, M
    PARALLEL AND DISTRIBUTED PROCESSING, 1998, 1388 : 714 - 730
  • [30] CORRELATED FAILURES IN FAULT-TOLERANT COMPUTERS
    HECHT, H
    DUSSAULT, H
    IEEE TRANSACTIONS ON RELIABILITY, 1987, 36 (02) : 171 - 175