Parallel checkpoint/restart without message logging

被引:9
|
作者
Meth, KZ [1 ]
Tuel, WG [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
D O I
10.1109/ICPPW.2000.869110
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.
引用
收藏
页码:253 / 258
页数:6
相关论文
共 50 条
  • [21] Prediction of Energy Consumption by Checkpoint/Restart in HPC
    Moran, M.
    Balladini, I
    Rexachs, D.
    Luque, E.
    IEEE ACCESS, 2019, 7 : 71791 - 71803
  • [22] Distributed Speculative Parallelization using Checkpoint Restart
    Ghoshal, Devarshi
    Ramkumar, Sreesudhan R.
    Chauhan, Arun
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 422 - 431
  • [23] Message logging in mobile computing
    Yao, Bin
    Ssu, Kuo-Feng
    Kent Fuchs, W.
    Proceedings - Annual International Conference on Fault-Tolerant Computing, 1999, : 294 - 301
  • [24] Message logging in mobile computing
    Yao, B
    Ssu, KF
    Fuchs, WK
    TWENTY-NINTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1999, : 294 - 301
  • [25] Berkeley lab checkpoint/restart (BLCR) for Linux clusters
    Hargrove, Paul H.
    Duell, Jason C.
    SCIDAC 2006: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2006, 46 : 494 - 499
  • [26] A model for predicting the optimum checkpoint interval for restart dumps
    Daly, J
    COMPUTATIONAL SCIENCE - ICCS 2003, PT IV, PROCEEDINGS, 2003, 2660 : 3 - 12
  • [27] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
    Shaiizad, Faisal
    Wittmann, Markus
    Kreutzer, Moritz
    Zeiser, Thomas
    Haler, Ceorc
    Wellein, Gerhahd
    PARALLEL PROCESSING LETTERS, 2013, 23 (04)
  • [28] DMTCP: Bringing interactive checkpoint-restart to Python
    Arya, Kapil
    Cooperman, Gene
    Computational Science and Discovery, 2015, 8 (01)
  • [29] Checkpoint and restart for distributed components in XCAT3
    Krishnan, S
    Gannon, D
    FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
  • [30] Virtualization aware job schedulers for checkpoint-restart
    Badrinath, R.
    Krishnakumar, R.
    Rajan, R. K. Palanivel
    2007 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, VOLS 1 AND 2, 2007, : 876 - 882