Correlated Set Coordination in Fault Tolerant Message Logging Protocols

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Herault, Thomas [1 ]
Bosilca, George [1 ]
Dongarra, Jack J. [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
引用
收藏
页码:51 / 64
页数:14
相关论文
共 50 条
  • [31] ZERMIA - A Fault Injector Framework for Testing Byzantine Fault Tolerant Protocols
    Soares, Joao
    Fernandez, Ricardo
    Silva, Miguel
    Freitas, Tadeu
    Martins, Rolando
    NETWORK AND SYSTEM SECURITY, NSS 2021, 2021, 13041 : 38 - 60
  • [32] Set-to-set fault tolerant routing in star graphs
    Gu, QP
    Peng, ST
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1996, E79D (04) : 282 - 289
  • [33] A fault-tolerant scheme for multicast communication protocols
    Bista, BB
    2005 Asia-Pacific Conference on Communications (APCC), Vols 1& 2, 2005, : 289 - 293
  • [34] Fault-tolerant message routing in computer networks
    Zakrevski, L
    Karpovsky, M
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, PROCEEDINGS, 1999, : 2279 - 2285
  • [35] Fault-tolerant message routing in the extended hypercube
    Kumar, MJ
    Patnaik, LM
    Nag, B
    JOURNAL OF SYSTEMS ARCHITECTURE, 1998, 44 (9-10) : 691 - 702
  • [36] Applying Message Logging to Support Fault-Tolerance of SOA Systems
    Danilecki, Arkadiusz
    Holenko, Mateusz
    Kobusinska, Anna
    Szychowiak, Michal
    Zierhoffer, Piotr
    FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2013, 38 (03) : 145 - 158
  • [37] A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead
    Yang, Jin-Min
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 392 - 401
  • [38] Hydra: Concurrent Coordination for Fault-tolerant Networking
    Biri, Andreas
    Da Forno, Reto
    Kuonen, Tobias
    Mager, Fabian
    Zimmerling, Marco
    Thiele, Lothar
    PROCEEDINGS OF THE 2023 THE 22ND INTERNATIONAL CONFERENCE ON INFORMATION PROCESSING IN SENSOR NETWORKS, IPSN 2023, 2023, : 219 - 232
  • [39] SAN-Based Modeling of Fault Tolerant Protocols for MANETs
    Benkaouha, Haroun
    Mokdad, Lynda
    Abdelli, Abdelkrim
    2014 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2014, : 336 - 341
  • [40] DepSpace: A Byzantine Fault-Tolerant Coordination Service
    Bessani, Alysson Neves
    Alchieri, Eduardo Pelison
    Correia, Miguel
    Fraga, Joni da Silva
    EUROSYS'08: PROCEEDINGS OF THE EUROSYS 2008 CONFERENCE, 2008, : 163 - 176