Local rollback for resilient MPI applications with application-level checkpointing and message logging

被引:20
|
作者
Losada, Nuria [1 ]
Bosilca, George [2 ]
Bouteiller, Aurelien [2 ]
Gonzalez, Patricia [1 ]
Martin, Maria J. [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, Coruna, Spain
[2] Univ Tennessee, Innovat Comp Lab, Knoxville, TN USA
基金
美国国家科学基金会;
关键词
MPI; Resilience; Message logging; Application-level checkpointing; Local rollback; FAULT-TOLERANT; RECOVERY;
D O I
10.1016/j.future.2018.09.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface - the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard - enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the Compiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level-thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:450 / 464
页数:15
相关论文
共 50 条
  • [41] In-memory application-level checkpoint-based migration for MPI programs
    Iván Cores
    Gabriel Rodríguez
    María J. Martín
    Patricia González
    The Journal of Supercomputing, 2014, 70 : 660 - 670
  • [42] In-memory application-level checkpoint-based migration for MPI programs
    Cores, Ivan
    Rodriguez, Gabriel
    Martin, Maria J.
    Gonzalez, Patricia
    JOURNAL OF SUPERCOMPUTING, 2014, 70 (02): : 660 - 670
  • [43] Autonomic Application-Level Message Delivery Using Virtual Magnetic Fields
    Luiz A. P. Lima
    Alcides Calsavara
    Journal of Network and Systems Management, 2010, 18 : 97 - 116
  • [44] Application-level Security for ROS-based Applications
    Dieber, Bernhard
    Kacianka, Severin
    Rass, Stefan
    Schartner, Peter
    2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 4477 - 4482
  • [45] Autonomic Application-Level Message Delivery Using Virtual Magnetic Fields
    Lima, Luiz A. P., Jr.
    Calsavara, Alcides
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2010, 18 (01) : 97 - 116
  • [46] On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
    Ropars, Thomas
    Guermouche, Amina
    Ucar, Bora
    Meneses, Esteban
    Kale, Laxmikant V.
    Cappello, Franck
    EURO-PAR 2011 PARALLEL PROCESSING, PT 1, 2011, 6852 : 567 - 578
  • [47] An Application-Level QoS Control Method Based on Local Bandwidth Scheduling
    Wang, Yong
    Xu, Fu
    Chen, Zhibo
    Sun, Yu
    Zhang, Haiyan
    JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, 2018, 2018
  • [48] UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS
    Tang, Yuan
    Zhan, Yunquan
    HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 864 - +
  • [49] MPMTP-AR: Multipath Message Transport Protocol Based on Application-Level Relay
    Liu, Shaowei
    Lei, Weimin
    Zhang, Wei
    Song, Xiaoshi
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2017, 11 (03): : 1406 - 1424
  • [50] Extractocol: Automatic Extraction of Application-level Protocol Behaviors for Android Applications
    Choi, Hyunwoo
    Kim, Jeongmin
    Hong, Hyunwook
    Kim, Yongdae
    Lee, Jonghyup
    Han, Dongsu
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2015, 45 (04) : 593 - +