Correlated Set Coordination in Fault Tolerant Message Logging Protocols

被引:0
|
作者
Bouteiller, Aurelien [1 ]
Herault, Thomas [1 ]
Bosilca, George [1 ]
Dongarra, Jack J. [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
引用
收藏
页码:51 / 64
页数:14
相关论文
共 50 条
  • [1] Correlated set coordination in fault tolerant message logging protocols for many-core clusters
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Dongarra, Jack J.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (04): : 572 - 585
  • [2] A lightweight message logging scheme for fault tolerant MPI
    Lee, I
    Yeom, HY
    Park, T
    Park, H
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 397 - 404
  • [3] Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications
    Meneses, Esteban
    HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 204 - 218
  • [4] SmartPath implementation of coordination protocols for a fault tolerant AHS design
    Lindsey, AE
    PROCEEDINGS OF THE 1997 AMERICAN CONTROL CONFERENCE, VOLS 1-6, 1997, : 2464 - 2468
  • [5] Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications
    Meneses, Esteban
    Kale, Laxmikant V.
    Bronevetsky, Greg
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 281 - 289
  • [6] Improved message logging versus Improved coordinated checkpointing for fault tolerant MPI
    Lemarinier, P
    Bouteiller, A
    Herault, T
    Krawezik, G
    Cappello, F
    2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, : 115 - 124
  • [7] MESSAGE COMPLEXITY VERSUS SPACE COMPLEXITY IN FAULT TOLERANT BROADCAST PROTOCOLS
    MORAN, S
    NETWORKS, 1989, 19 (05) : 505 - 519
  • [8] The cost of recovery in message logging protocols
    Rao, SR
    Alvisi, L
    Vin, HM
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 10 - 18
  • [9] The cost of recovery in message logging protocols
    Rao, S
    Alvisi, L
    Vin, HM
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) : 160 - 173
  • [10] Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols
    Meyer, Hugo
    Muresano, Ronal
    Castro-Leon, Marcela
    Rexachs, Dolores
    Luque, Emilio
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 104 : 206 - 222