Local rollback for resilient MPI applications with application-level checkpointing and message logging

被引:20
|
作者
Losada, Nuria [1 ]
Bosilca, George [2 ]
Bouteiller, Aurelien [2 ]
Gonzalez, Patricia [1 ]
Martin, Maria J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Coruna, Spain
[2] Univ Tennessee, Innovat Comp Lab, Knoxville, TN USA
基金
美国国家科学基金会;
关键词
MPI; Resilience; Message logging; Application-level checkpointing; Local rollback; FAULT-TOLERANT; RECOVERY;
D O I
10.1016/j.future.2018.09.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface - the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard - enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the Compiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level-thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:450 / 464
页数:15
相关论文
共 50 条
  • [21] A Domain-Specific Language for Application-Level Checkpointing
    Arora, Ritu
    Mernik, Marjan
    Bangalore, Purushotham
    Roychoudhury, Suman
    Mukkai, Saraswathi
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2008, 5375 : 26 - 38
  • [22] A technique for non-invasive application-level checkpointing
    Arora, Ritu
    Bangalore, Purushotham
    Mernik, Marjan
    JOURNAL OF SUPERCOMPUTING, 2011, 57 (03): : 227 - 255
  • [23] Power Log'n'Roll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols
    Dichev, Kiril
    De Sensi, Daniele
    Nikolopoulos, Dimitrios S.
    Cameron, Kirk W.
    Spence, Ivor
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (06) : 1276 - 1288
  • [24] Adaptation strategies for application-level computation migration/checkpointing
    Ji, YQ
    Jiang, H
    Chaudhary, V
    PDPTA '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-3, 2005, : 1156 - 1162
  • [25] Performance evaluation of an application-level checkpointing solution on grids
    Rodriguez, Gabriel
    Pardo, Xoan C.
    Martin, Maria J.
    Gonzalez, Patricia
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2010, 26 (07): : 1012 - 1023
  • [26] A technique for non-invasive application-level checkpointing
    Ritu Arora
    Purushotham Bangalore
    Marjan Mernik
    The Journal of Supercomputing, 2011, 57 : 227 - 255
  • [27] Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2014, 20 (09) : 1352 - 1372
  • [28] Runtime Interval Optimization and Dependable Performance for Application-Level Checkpointing
    Kokolis, Apostolos
    Mavrogiannis, Alexandros
    Rodopoulos, Dimitrios
    Strydis, Christos
    Soudris, Dimitrios
    PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2016, : 594 - 599
  • [29] Reducing the overhead of an MPI application-level migration approach
    Cores, Ivan
    Rodriguez, Monica
    Gonzalez, Patricia
    Martin, Maria J.
    PARALLEL COMPUTING, 2016, 54 : 72 - 82
  • [30] An Application-Level Incremental Checkpointing Mechanism with Automatic Parameter Tuning
    Takizawa, Hiroyuki
    Amrizal, Muhammad Alfian
    Komatsu, Kazuhiko
    Egawa, Ryusuke
    2017 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2017, : 389 - 394