Parallel checkpoint/restart without message logging

被引:9
|
作者
Meth, KZ [1 ]
Tuel, WG [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
D O I
10.1109/ICPPW.2000.869110
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.
引用
收藏
页码:253 / 258
页数:6
相关论文
共 50 条
  • [41] The Message Logging System for NOvA Experiment
    Lu, Qiming
    Kowalkowski, J. B.
    Biery, K. A.
    INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2010), 2011, 331
  • [42] Checkpoint Interval and System's Overall Quality for Message Logging-based Rollback and Recovery in Distributed and Embedded Computing
    Chen, Nianen
    Yu, Yue
    Ren, Shangping
    2009 INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS, PROCEEDINGS, 2009, : 315 - +
  • [43] CRState: checkpoint/restart of OpenCL program for in-kernel applications
    Chen, Genlang
    Zhang, Jiajian
    Zhu, Zufang
    Jiang, Qiangqiang
    Jiang, Hai
    Pang, Chaoyi
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (06): : 5426 - 5467
  • [44] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
    Gholami, Masoud
    Schintke, Florian
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
  • [45] CRState: checkpoint/restart of OpenCL program for in-kernel applications
    Genlang Chen
    Jiajian Zhang
    Zufang Zhu
    Qiangqiang Jiang
    Hai Jiang
    Chaoyi Pang
    The Journal of Supercomputing, 2021, 77 : 5426 - 5467
  • [46] Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart
    Ahmed, Arif
    Mohan, Apoorve
    Cooperman, Gene
    Pierre, Guillaume
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON MOBILE CLOUD COMPUTING, SERVICES, AND ENGINEERING (MOBILE CLOUD 2020), 2020, : 55 - 62
  • [47] Exploration of Lossy Compression for Application-level Checkpoint/Restart
    Sasaki, Naoto
    Sato, Kento
    Endo, Toshio
    Matsuoka, Satoshi
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 914 - 922
  • [48] A Fast Restart Mechanism for Checkpoint/Recovery Protocols in Networked Environments
    Li, Yawei
    Lan, Zhiling
    2008 IEEE INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS & NETWORKS WITH FTCS & DCC, 2008, : 217 - 226
  • [49] Checkpoint/restart approaches for a thread-based MPI runtime
    Adam, Julien
    Kermarquer, Maxime
    Besnard, Jean-Baptiste
    Bautista-Gomez, Leonardo
    Perache, Marc
    Carribault, Patrick
    Jaeger, Julien
    Malony, Allen D.
    Shende, Sameer
    PARALLEL COMPUTING, 2019, 85 : 204 - 219
  • [50] Availability in parallel systems: Automatic process restart
    Bowen, NS
    Antognini, J
    Regan, RD
    Matsakis, NC
    IBM SYSTEMS JOURNAL, 1997, 36 (02) : 284 - 300