Local rollback for resilient MPI applications with application-level checkpointing and message logging

被引:20
|
作者
Losada, Nuria [1 ]
Bosilca, George [2 ]
Bouteiller, Aurelien [2 ]
Gonzalez, Patricia [1 ]
Martin, Maria J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Coruna, Spain
[2] Univ Tennessee, Innovat Comp Lab, Knoxville, TN USA
基金
美国国家科学基金会;
关键词
MPI; Resilience; Message logging; Application-level checkpointing; Local rollback; FAULT-TOLERANT; RECOVERY;
D O I
10.1016/j.future.2018.09.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface - the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard - enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the Compiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level-thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. (C) 2018 Elsevier B.V. All rights reserved.
引用
下载
收藏
页码:450 / 464
页数:15
相关论文
共 50 条
  • [1] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Losada, Nuria
    Cores, Ivan
    Martin, Maria J.
    Gonzalez, Patricia
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (01): : 100 - 113
  • [2] Resilient MPI applications using an application-level checkpointing framework and ULFM
    Nuria Losada
    Iván Cores
    María J. Martín
    Patricia González
    The Journal of Supercomputing, 2017, 73 : 100 - 113
  • [3] Automated application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    ACM SIGPLAN NOTICES, 2003, 38 (10) : 84 - 94
  • [4] Portable Application-level Checkpointing for Hybrid MPI-OpenMP Applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 : 19 - 29
  • [5] Insights into application-level solutions towards resilient MPI applications
    Gonzalez, Patricia
    Losada, Nuria
    Martin, Maria J.
    PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2018, : 610 - 613
  • [6] Static analysis for application-level checkpointing of MPI programs
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Yang, Xuejun
    Zhou, Haifang
    HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 548 - 555
  • [7] Compiler-Assisted Application-Level Checkpointing for MPI Programs
    Yang, Xuejun
    Wang, Panfeng
    Fu, Hongyi
    Du, Yunfei
    Wang, Zhiyuan
    Jia, Jia
    28TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, VOLS 1 AND 2, PROCEEDINGS, 2008, : 251 - 259
  • [8] Checkpointing RSIP applications at application-level in ChinaGrid
    Li, CJ
    Yang, XJ
    Xiao, N
    Current Trends in High Performance Computing and Its Applications, Proceedings, 2005, : 351 - 356
  • [9] C3:: A system for automating application-level checkpointing of MPI programs
    Bronevetsky, G
    Marques, D
    Pingali, K
    Stodghill, P
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2004, 2958 : 357 - 373
  • [10] An Application-Level Solution for the Dynamic Reconfiguration of MPI Applications
    Cores, Ivan
    Gonzalez, Patricia
    Jeannot, Emmanuel
    Martin, Maria J.
    Rodriguez, Gabriel
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 191 - 205