Correlated Set Coordination in Fault Tolerant Message Logging Protocols

被引：0

作者：

Bouteiller, Aurelien ^{[1
]}

Herault, Thomas ^{[1
]}

Bosilca, George ^{[1
]}

Dongarra, Jack J. ^{[1
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA

来源：

EURO-PAR 2011 PARALLEL PROCESSING, PT 2 | 2011年 / 6853卷

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

引用

页码：51 / 64

页数：14

共 50 条

[1] Correlated set coordination in fault tolerant message logging protocols for many-core clusters
Bouteiller, Aurelien
Herault, Thomas
Bosilca, George
Dongarra, Jack J.
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (04): : 572 - 585
[2] A lightweight message logging scheme for fault tolerant MPI
Lee, I
Yeom, HY
Park, T
Park, H
PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2004, 3019 : 397 - 404
[3] Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications
Meneses, Esteban
HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 204 - 218
[4] SmartPath implementation of coordination protocols for a fault tolerant AHS design
Lindsey, AE
PROCEEDINGS OF THE 1997 AMERICAN CONTROL CONFERENCE, VOLS 1-6, 1997, : 2464 - 2468
[5] Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications
Meneses, Esteban
Kale, Laxmikant V.
Bronevetsky, Greg
2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 281 - 289
[6] Improved message logging versus Improved coordinated checkpointing for fault tolerant MPI
Lemarinier, P
Bouteiller, A
Herault, T
Krawezik, G
Cappello, F
2004 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2004, : 115 - 124
[7] MESSAGE COMPLEXITY VERSUS SPACE COMPLEXITY IN FAULT TOLERANT BROADCAST PROTOCOLS
MORAN, S
NETWORKS, 1989, 19 (05) : 505 - 519
[8] The cost of recovery in message logging protocols
Rao, SR
Alvisi, L
Vin, HM
SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 10 - 18
[9] The cost of recovery in message logging protocols
Rao, S
Alvisi, L
Vin, HM
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) : 160 - 173
[10] Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols
Meyer, Hugo
Muresano, Ronal
Castro-Leon, Marcela
Rexachs, Dolores
Luque, Emilio
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 104 : 206 - 222

← 1 2 3 4 5 →