Application-level checkpointing for shared memory programs

被引:28
|
作者
Bronevetsky, G [1 ]
Marques, D [1 ]
Pingali, K [1 ]
Szwed, P [1 ]
Schulz, M [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
fault-tolerance; checkpointing; shared-memory; programs; OpenMP;
D O I
10.1145/1037187.1024421
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory, programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a run-time system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
引用
收藏
页码:235 / 247
页数:13
相关论文
共 50 条
  • [1] Automated application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    [J]. ACM SIGPLAN NOTICES, 2003, 38 (10) : 84 - 94
  • [2] Application-level checkpointing techniques for parallel programs
    Walters, John Paul
    Chaudhary, Vipin
    [J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2006, 4317 : 221 - +
  • [3] Static analysis for application-level checkpointing of MPI programs
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Yang, Xuejun
    Zhou, Haifang
    [J]. HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 548 - 555
  • [4] Compiler-Assisted Application-Level Checkpointing for MPI Programs
    Yang, Xuejun
    Wang, Panfeng
    Fu, Hongyi
    Du, Yunfei
    Wang, Zhiyuan
    Jia, Jia
    [J]. 28TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, VOLS 1 AND 2, PROCEEDINGS, 2008, : 251 - 259
  • [5] C3:: A system for automating application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    [J]. LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2004, 2958 : 357 - 373
  • [6] ITALC: Interactive Tool for Application-Level Checkpointing
    Arora, Ritu
    Trung Nguyen Ba
    [J]. HUST'17: PROCEEDINGS OF THE FOURTH INTERNATIONAL WORKSHOP ON HPC USER SUPPORT TOOLS, 2017,
  • [7] Checkpointing RSIP applications at application-level in ChinaGrid
    Li, CJ
    Yang, XJ
    Xiao, N
    [J]. Current Trends in High Performance Computing and Its Applications, Proceedings, 2005, : 351 - 356
  • [8] Automated Application-Level Checkpointing Based on Live-variable Analysis in MPI Programs
    Wang, Panfeng
    Yang, Xuejun
    Fu, Hongyi
    Du, Yunfei
    Wang, Zhiyuan
    Jia, Jia
    [J]. PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 273 - 274
  • [9] WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs
    Xu, Xinhai
    Yang, Xuejun
    Lin, Yufei
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (03): : 786 - 796
  • [10] System-Level vs. Application-Level Checkpointing
    Posner, Jonas
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 404 - 405