C3:: A system for automating application-level checkpointing of MPI programs

被引:2
|
作者
Bronevetsky, G [1 ]
Marques, D [1 ]
Pingali, K [1 ]
Stodghill, P [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
Application level - Check pointing - Checkpointing techniques - Coordination protocols - Equivalent faults - High-performance platforms - MPI applications - Program variables;
D O I
10.1007/978-3-540-24644-2_23
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to W save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In ([1],[2]) we have presented a distributed checkpoint coordination protocol which handles MPI's point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C-3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.
引用
下载
收藏
页码:357 / 373
页数:17
相关论文
共 50 条
  • [1] Automated application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    ACM SIGPLAN NOTICES, 2003, 38 (10) : 84 - 94
  • [2] Static analysis for application-level checkpointing of MPI programs
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Yang, Xuejun
    Zhou, Haifang
    HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 548 - 555
  • [3] Compiler-Assisted Application-Level Checkpointing for MPI Programs
    Yang, Xuejun
    Wang, Panfeng
    Fu, Hongyi
    Du, Yunfei
    Wang, Zhiyuan
    Jia, Jia
    28TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, VOLS 1 AND 2, PROCEEDINGS, 2008, : 251 - 259
  • [4] Automated Application-Level Checkpointing Based on Live-variable Analysis in MPI Programs
    Wang, Panfeng
    Yang, Xuejun
    Fu, Hongyi
    Du, Yunfei
    Wang, Zhiyuan
    Jia, Jia
    PPOPP'08: PROCEEDINGS OF THE 2008 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2008, : 273 - 274
  • [5] WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs
    Xu, Xinhai
    Yang, Xuejun
    Lin, Yufei
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2012, E95D (03): : 786 - 796
  • [6] Application-level checkpointing for shared memory programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Szwed, P
    Schulz, M
    ACM SIGPLAN NOTICES, 2004, 39 (11) : 235 - 247
  • [7] Application-level checkpointing techniques for parallel programs
    Walters, John Paul
    Chaudhary, Vipin
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2006, 4317 : 221 - +
  • [8] Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 : 19 - 29
  • [9] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Losada, Nuria
    Cores, Ivan
    Martin, Maria J.
    Gonzalez, Patricia
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (01): : 100 - 113
  • [10] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Nuria Losada
    Iván Cores
    María J. Martín
    Patricia González
    The Journal of Supercomputing, 2017, 73 : 100 - 113