Parallel checkpoint/restart without message logging

被引:9
|
作者
Meth, KZ [1 ]
Tuel, WG [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
D O I
10.1109/ICPPW.2000.869110
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.
引用
收藏
页码:253 / 258
页数:6
相关论文
共 50 条
  • [31] Job migration in HPC clusters by means of checkpoint/restart
    Manuel Rodríguez-Pascual
    Jiajun Cao
    José A. Moríñigo
    Gene Cooperman
    Rafael Mayo-García
    The Journal of Supercomputing, 2019, 75 : 6517 - 6541
  • [32] An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
    Levy, Scott
    Ferreira, Kurt B.
    PROCEEDINGS OF THE ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE (FTXS'16), 2016, : 35 - 42
  • [33] Job migration in HPC clusters by means of checkpoint/restart
    Rodriguez-Pascual, Manuel
    Cao, Jiajun
    Morinigo, Jose A.
    Cooperman, Gene
    Mayo-Garcia, Rafael
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (10): : 6517 - 6541
  • [34] Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart
    Zhang, Jialing
    Zhuo, Xiaoyan
    Moon, Aekyeung
    Liu, Hang
    Son, Seung Woo
    2019 35TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST 2019), 2019, : 79 - 91
  • [35] Message logging optimization for wireless networks
    Yao, B
    Fuchs, WK
    20TH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2001, : 182 - 185
  • [36] The cost of recovery in message logging protocols
    Rao, SR
    Alvisi, L
    Vin, HM
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 10 - 18
  • [37] Improving Message Logging Protocols Scalability through Distributed Event Logging
    Ropars, Thomas
    Morin, Christine
    EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 511 - 522
  • [38] The cost of recovery in message logging protocols
    Rao, S
    Alvisi, L
    Vin, HM
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2000, 12 (02) : 160 - 173
  • [39] An efficient algorithm for causal message logging
    Lee, B
    Park, T
    Yeom, HY
    Cho, Y
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 19 - 25
  • [40] CMLOG: A common message logging system
    Chen, J
    Akers, W
    Bickley, M
    Wu, DJ
    Watson, W
    ACCELERATOR AND LARGE EXPERIMENTAL PHYSICS CONTROL SYSTEMS, 1997, : 358 - 363