Parallel checkpoint/restart without message logging

被引：9

作者：

Meth, KZ ^{[1
]}

Tuel, WG ^{[1
]}

机构：

[1] IBM Corp, Haifa Res Lab, Haifa, Israel

来源：

2000 INTERNATIONAL WORKSHOPS ON PARALLEL PROCESSING, PROCEEDINGS | 2000年

关键词：

D O I：

10.1109/ICPPW.2000.869110

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.

引用

页码：253 / 258

页数：6

共 50 条

[1] Checkpoint/Restart-Enabled Parallel Debugging
Hursey, Joshua
January, Chris
O'Connor, Mark
Hargrove, Paul H.
Lecomber, David
Squyres, Jeffrey M.
Lumsdaine, Andrew
RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2010, 6305 : 219 - +
[2] Hierarchical composition of coordinated checkpoint with pessimistic message logging
Ndiaye, Ndeye Massata
Sens, Pierre
Thiare, Ousmane
2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 752 - 756
[3] Replaying distributed programs without message logging
Netzer, RHB
Xu, YK
SIXTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 1997, : 137 - 147
[4] FREM: A Fast Restart Mechanism for General Checkpoint/Restart
Li, Yawei
Lan, Zhiling
IEEE TRANSACTIONS ON COMPUTERS, 2011, 60 (05) : 639 - 652
[5] Combining Coordinated and Uncoordinated Checkpoint in Pessimistic Sender-Based Message Logging
Aminian, Mehdi
Akbari, Mohammad K.
Javadi, Bahman
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2006, 6 (04): : 156 - 161
[6] An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems
Jiang, Qiangfeng
Luo, Yi
Manivannan, D.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (12) : 1575 - 1589
[7] Scalable group-based checkpoint/restart for large-scale message-passing systems
Ho, Justin C. Y.
Wang, Cho-Li
Lau, Francis C. M.
2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1749 - 1760
[8] Optimizing Checkpoint Restart with Data Deduplication
Chen, Zhengyu
Sun, Jianhua
Chen, Hao
SCIENTIFIC PROGRAMMING, 2016, 2016
[9] Microservice Debugging with Checkpoint-Restart
Merino, Xavier
Otero, Carlos E.
2023 IEEE CLOUD SUMMIT, 2023, : 58 - 63
[10] Affinity-Aware Checkpoint Restart
Saini, Ajay
Rezaei, Arash
Mueller, Frank
Hargrove, Paul
Roman, Eric
ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, : 121 - 132

← 1 2 3 4 5 →