Local rollback for resilient MPI applications with application-level checkpointing and message logging

被引:20
|
作者
Losada, Nuria [1 ]
Bosilca, George [2 ]
Bouteiller, Aurelien [2 ]
Gonzalez, Patricia [1 ]
Martin, Maria J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Coruna, Spain
[2] Univ Tennessee, Innovat Comp Lab, Knoxville, TN USA
基金
美国国家科学基金会;
关键词
MPI; Resilience; Message logging; Application-level checkpointing; Local rollback; FAULT-TOLERANT; RECOVERY;
D O I
10.1016/j.future.2018.09.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface - the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard - enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the Compiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level-thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:450 / 464
页数:15
相关论文
共 50 条
  • [31] Active Optimistic Message Logging for Reliable Execution of MPI Applications
    Ropars, Thomas
    Morin, Christine
    EURO-PAR 2009: PARALLEL PROCESSING, PROCEEDINGS, 2009, 5704 : 615 - +
  • [32] Implementing Efficient Message Logging Protocols as MPI Application Extensions
    Dichev, Kiril
    Nikolopoulos, Dimitrios S.
    EUROMPI'19: PROCEEDINGS OF THE 26TH EUROPEAN MPI USERS' GROUP MEETING, 2019,
  • [33] Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
    Losada, Nuria
    Bouteiller, Aurelien
    Bosilca, George
    PROCEEDINGS OF FTXS 2019: IEEE/ACM 9TH WORKSHOP ON FAULT TOLERANCE FOR HPC AT EXTREME SCALE (FTXS), 2019, : 1 - 10
  • [34] iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
    John, Jophin
    Araya, Isaac David Nunez
    Gerndt, Michael
    2022 IEEE 28TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, ICPADS, 2022, : 467 - 474
  • [35] An Application-Level Approach for Privacy-preserving Virtual Machine Checkpointing
    Hu, Yaohui
    Li, Tianlin
    Yang, Ping
    Gopalan, Kartik
    2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, : 59 - 66
  • [36] SimSnap: Fast-forwarding via native execution and application-level checkpointing
    Szwed, PK
    Marques, D
    Buels, RM
    McKee, SA
    Schulz, M
    EIGHTH WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2004, : 65 - 74
  • [37] An application-level checkpointing based on extended data flow analysis for OpenMP programs
    Fu H.-Y.
    Ding Y.
    Song W.
    Yang X.-J.
    Jisuanji Xuebao/Chinese Journal of Computers, 2010, 33 (10): : 1809 - 1822
  • [38] Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System
    Fu, Jing
    Min, Misun
    Latham, Robert
    Carothers, Christopher D.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 465 - 473
  • [39] LARK: A light-weight, resilient application-level multicast protocol
    Kandula, S
    Lee, JK
    Hou, JC
    CCW 2003: IEEE 18TH ANNUAL WORKSHOP ON COMPUTER COMMUNICATIONS, PROCEEDINGS, 2003, : 201 - 209
  • [40] Robust and Attack Resilient Logic Locking with a High Application-Level Impact
    Liu, Yuntao
    Zuzak, Michael
    Xie, Yang
    Chakraborty, Abhishek
    Srivastava, Ankur
    ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2021, 17 (03)