Parallel checkpoint/restart without message logging

被引:9
|
作者
Meth, KZ [1 ]
Tuel, WG [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
D O I
10.1109/ICPPW.2000.869110
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.
引用
收藏
页码:253 / 258
页数:6
相关论文
共 50 条
  • [1] Checkpoint/Restart-Enabled Parallel Debugging
    Hursey, Joshua
    January, Chris
    O'Connor, Mark
    Hargrove, Paul H.
    Lecomber, David
    Squyres, Jeffrey M.
    Lumsdaine, Andrew
    RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2010, 6305 : 219 - +
  • [2] Hierarchical composition of coordinated checkpoint with pessimistic message logging
    Ndiaye, Ndeye Massata
    Sens, Pierre
    Thiare, Ousmane
    2012 2ND IEEE INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2012, : 752 - 756
  • [3] Replaying distributed programs without message logging
    Netzer, RHB
    Xu, YK
    SIXTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 1997, : 137 - 147
  • [4] FREM: A Fast Restart Mechanism for General Checkpoint/Restart
    Li, Yawei
    Lan, Zhiling
    IEEE TRANSACTIONS ON COMPUTERS, 2011, 60 (05) : 639 - 652
  • [5] Combining Coordinated and Uncoordinated Checkpoint in Pessimistic Sender-Based Message Logging
    Aminian, Mehdi
    Akbari, Mohammad K.
    Javadi, Bahman
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2006, 6 (04): : 156 - 161
  • [6] An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems
    Jiang, Qiangfeng
    Luo, Yi
    Manivannan, D.
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2008, 68 (12) : 1575 - 1589
  • [7] Scalable group-based checkpoint/restart for large-scale message-passing systems
    Ho, Justin C. Y.
    Wang, Cho-Li
    Lau, Francis C. M.
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 1749 - 1760
  • [8] Optimizing Checkpoint Restart with Data Deduplication
    Chen, Zhengyu
    Sun, Jianhua
    Chen, Hao
    SCIENTIFIC PROGRAMMING, 2016, 2016
  • [9] Microservice Debugging with Checkpoint-Restart
    Merino, Xavier
    Otero, Carlos E.
    2023 IEEE CLOUD SUMMIT, 2023, : 58 - 63
  • [10] Affinity-Aware Checkpoint Restart
    Saini, Ajay
    Rezaei, Arash
    Mueller, Frank
    Hargrove, Paul
    Roman, Eric
    ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, : 121 - 132