Application-level checkpointing for shared memory programs

被引:28
|
作者
Bronevetsky, G [1 ]
Marques, D [1 ]
Pingali, K [1 ]
Szwed, P [1 ]
Schulz, M [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
fault-tolerance; checkpointing; shared-memory; programs; OpenMP;
D O I
10.1145/1037187.1024421
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory, programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a run-time system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
引用
收藏
页码:235 / 247
页数:13
相关论文
共 50 条
  • [21] In-memory application-level checkpoint-based migration for MPI programs
    Cores, Ivan
    Rodriguez, Gabriel
    Martin, Maria J.
    Gonzalez, Patricia
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 70 (02): : 660 - 670
  • [22] iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
    John, Jophin
    Araya, Isaac David Nunez
    Gerndt, Michael
    [J]. 2022 IEEE 28TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, ICPADS, 2022, : 467 - 474
  • [23] An Application-Level Approach for Privacy-preserving Virtual Machine Checkpointing
    Hu, Yaohui
    Li, Tianlin
    Yang, Ping
    Gopalan, Kartik
    [J]. 2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, : 59 - 66
  • [24] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Nuria Losada
    Iván Cores
    María J. Martín
    Patricia González
    [J]. The Journal of Supercomputing, 2017, 73 : 100 - 113
  • [25] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Losada, Nuria
    Cores, Ivan
    Martin, Maria J.
    Gonzalez, Patricia
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (01): : 100 - 113
  • [26] Application-level memory optimization for MPSoC
    Girodias, B.
    Bouchebaba, Y.
    Nicolescu, G.
    Aboulhamid, E. M.
    Paulin, P.
    Lavigueur, B.
    [J]. SEVENTEENTH IEEE INTERNATIONAL WORKSHOP ON RAPID SYSTEM PROTOTYPING, 2006, : 169 - +
  • [27] Local rollback for resilient MPI applications with application-level checkpointing and message logging
    Losada, Nuria
    Bosilca, George
    Bouteiller, Aurelien
    Gonzalez, Patricia
    Martin, Maria J.
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 91 : 450 - 464
  • [28] Checkpointing distributed shared memory
    Silva, LM
    Silva, JG
    [J]. JOURNAL OF SUPERCOMPUTING, 1997, 11 (02): : 137 - 158
  • [29] SimSnap: Fast-forwarding via native execution and application-level checkpointing
    Szwed, PK
    Marques, D
    Buels, RM
    McKee, SA
    Schulz, M
    [J]. EIGHTH WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2004, : 65 - 74
  • [30] Checkpointing Distributed Shared Memory
    Luis M. Silva
    João Gabriel Silva
    [J]. The Journal of Supercomputing, 1997, 11 : 137 - 158