Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

被引:3
|
作者
Losada, Nuria [1 ]
Bouteiller, Aurelien [1 ]
Bosilca, George [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
基金
美国国家科学基金会;
关键词
fault tolerance; MPI; User Level Fault Mitigation; ULFM; message logging; checkpoint/restart; MESSAGE; PERFORMANCE; RECOVERY;
D O I
10.1109/FTXS49593.2019.00006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
  • [1] Receiver-Driven Broadcast for Vehicular Applications
    Kim, Dohyung
    Yeom, Ikjun
    Lee, Tae-Jin
    2017 IEEE VEHICULAR NETWORKING CONFERENCE (VNC), 2017, : 239 - 242
  • [2] Poster: Receiver-Driven Semi-Broadcast for Vehicular Applications
    Kim, Dohyung
    Lee, Tae-Jin
    Yeom, Ikjun
    CARSYS'17: PROCEEDINGS OF THE 2ND ACM INTERNATIONAL WORKSHOP ON SMART, AUTONOMOUS, AND CONNECTED VEHICULAR SYSTEMS AND SERVICES, 2017, : 75 - 76
  • [3] Receiver-Driven Congestion Control for InfiniBand
    Zhang, Yiran
    Qian, Kun
    Ren, Fengyuan
    50TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2021,
  • [4] A receiver-driven transport protocol for the Web
    Gupta, R
    Chen, M
    McCanne, S
    Walrand, J
    TELECOMMUNICATION SYSTEMS, 2002, 21 (2-4) : 213 - 230
  • [5] A Receiver-Driven Transport Protocol for the Web
    Rajarshi Gupta
    Mike Chen
    Steven McCanne
    Jean Walrand
    Telecommunication Systems, 2002, 21 : 213 - 230
  • [6] Receiver-driven bandwidth sharing for TCP
    Mehra, P
    Zakhor, A
    De Vleeschouwer, C
    IEEE INFOCOM 2003: THE CONFERENCE ON COMPUTER COMMUNICATIONS, VOLS 1-3, PROCEEDINGS, 2003, : 1145 - 1155
  • [7] Multisource receiver-driven layered multicast
    Zhao, Liang
    Yamamoto, Hideo
    TENCON 2005 - 2005 IEEE REGION 10 CONFERENCE, VOLS 1-5, 2006, : 1323 - 1326
  • [8] Receiver-driven handover between Independent Networks
    Tallon, Justin
    Kibilda, Jacek
    Forde, Tim K.
    DaSilva, Luiz A.
    Doyle, Linda
    2012 IEEE INTERNATIONAL SYMPOSIUM ON DYNAMIC SPECTRUM ACCESS NETWORKS, 2012, : 276 - 277
  • [9] Local rollback for resilient MPI applications with application-level checkpointing and message logging
    Losada, Nuria
    Bosilca, George
    Bouteiller, Aurelien
    Gonzalez, Patricia
    Martin, Maria J.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 91 : 450 - 464
  • [10] Receiver-driven Flow Scheduling for Commodity Datacenters
    Khan, Aadil Zia
    Qazi, Ihsan Ayyub
    2017 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2017,