Parallel checkpoint/restart without message logging

被引:9
|
作者
Meth, KZ [1 ]
Tuel, WG [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
D O I
10.1109/ICPPW.2000.869110
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We describe a parallel checkpoint/restart mechanism. The checkpoint is performed among the participating parallel tasks using a new algorithm that,ve call stop and discard. Tasks may be checkpointed without waiting for previously sent messages to be received. Specific message logging is not required. Message data that may be in transit is saved in the checkpoint files.
引用
收藏
页码:253 / 258
页数:6
相关论文
共 50 条
  • [11] Efficient checkpoint/Restart of CUDA applications
    Nukada, Akira
    Suzuki, Taichiro
    Matsuoka, Satoshi
    PARALLEL COMPUTING, 2023, 116
  • [12] Message fragment based causal message logging
    Ci, Yi-Wei
    Zhang, Zhan
    Zuo, De-Cheng
    Yang, Xiao-Zong
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (11) : 915 - 921
  • [13] Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols
    Meyer, Hugo
    Muresano, Ronal
    Castro-Leon, Marcela
    Rexachs, Dolores
    Luque, Emilio
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2017, 104 : 206 - 222
  • [14] A Flexible Checkpoint/Restart Model in Distributed Systems
    Bouguerra, Mohamed-Slim
    Gautier, Thierry
    Trystram, Denis
    Vincent, Jean-Marc
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 206 - +
  • [15] Checkpoint/Restart in Practice: When 'Simple is Better'
    El-Sayed, Nosayba
    Schroeder, Bianca
    2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 84 - 92
  • [16] CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
    Takizawa, Hiroyuki
    Sato, Katsuto
    Komatsu, Kazuhiko
    Kobayashi, Hiroaki
    2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 408 - +
  • [17] Interconnect Agnostic Checkpoint/Restart in Open MPI
    Hursey, Joshua
    Mattox, Timothy I.
    Lumsdaine, Andrew
    HPDC'09: 18TH ACM INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 2009, : 49 - 58
  • [18] Checkpoint and Restart: An Energy Consumption Characterization in Clusters
    Moran, Marina
    Balladini, Javier
    Rexachs, Dolores
    Luque, Emilio
    COMPUTER SCIENCE - CACIC 2018, 2019, 995 : 19 - 33
  • [19] Checkpoint Restart Support for Heterogeneous HPC Applications
    Parasyris, Konstantinos
    Keller, Kai
    Bautista-Gomez, Leonardo
    Unsal, Osman
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 242 - 251
  • [20] Checkpoint-Restart for a Network of Virtual Machines
    Garg, Rohan
    Sodha, Komal
    Jin, Zhengping
    Cooperman, Gene
    2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,