C3:: A system for automating application-level checkpointing of MPI programs

被引:2
|
作者
Bronevetsky, G [1 ]
Marques, D [1 ]
Pingali, K [1 ]
Stodghill, P [1 ]
机构
[1] Cornell Univ, Dept Comp Sci, Ithaca, NY 14853 USA
关键词
Application level - Check pointing - Checkpointing techniques - Coordination protocols - Equivalent faults - High-performance platforms - MPI applications - Program variables;
D O I
10.1007/978-3-540-24644-2_23
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to W save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In ([1],[2]) we have presented a distributed checkpoint coordination protocol which handles MPI's point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C-3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.
引用
收藏
页码:357 / 373
页数:17
相关论文
共 50 条
  • [21] Performance evaluation of an application-level checkpointing solution on grids
    Rodriguez, Gabriel
    Pardo, Xoan C.
    Martin, Maria J.
    Gonzalez, Patricia
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2010, 26 (07): : 1012 - 1023
  • [22] A technique for non-invasive application-level checkpointing
    Ritu Arora
    Purushotham Bangalore
    Marjan Mernik
    The Journal of Supercomputing, 2011, 57 : 227 - 255
  • [23] Runtime Interval Optimization and Dependable Performance for Application-Level Checkpointing
    Kokolis, Apostolos
    Mavrogiannis, Alexandros
    Rodopoulos, Dimitrios
    Strydis, Christos
    Soudris, Dimitrios
    PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 594 - 599
  • [24] Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System
    Fu, Jing
    Min, Misun
    Latham, Robert
    Carothers, Christopher D.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 465 - 473
  • [25] An Application-Level Solution for the Dynamic Reconfiguration of MPI Applications
    Cores, Ivan
    Gonzalez, Patricia
    Jeannot, Emmanuel
    Martin, Maria J.
    Rodriguez, Gabriel
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 191 - 205
  • [26] Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets
    Keller, Kai
    Gomez, Leonardo Bautista
    2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 52 - 61
  • [27] Reducing the overhead of an MPI application-level migration approach
    Cores, Ivan
    Rodriguez, Monica
    Gonzalez, Patricia
    Martin, Maria J.
    PARALLEL COMPUTING, 2016, 54 : 72 - 82
  • [28] An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning
    Takizawa, Hiroyuki
    Amrizal, Muhammad Alfian
    Komatsu, Kazuhiko
    Egawa, Ryusuke
    2017 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2017, : 389 - 394
  • [29] Failure Avoidance in MPI Applications Using an Application-Level Approach
    Cores, Ivan
    Rodriguez, Gabriel
    Gonzalez, Patricia
    Martin, Maria J.
    COMPUTER JOURNAL, 2014, 57 (01): : 100 - 114
  • [30] iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
    John, Jophin
    Araya, Isaac David Nunez
    Gerndt, Michael
    2022 IEEE 28TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, ICPADS, 2022, : 467 - 474