Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

被引:3
|
作者
Losada, Nuria [1 ]
Bouteiller, Aurelien [1 ]
Bosilca, George [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
基金
美国国家科学基金会;
关键词
fault tolerance; MPI; User Level Fault Mitigation; ULFM; message logging; checkpoint/restart; MESSAGE; PERFORMANCE; RECOVERY;
D O I
10.1109/FTXS49593.2019.00006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
  • [21] Quality incentive assisted congestion control or receiver-driven multicast
    Johansen, Stian
    Kim, Anna N.
    Perkis, Andrew
    2007 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, VOLS 1-14, 2007, : 1642 - 1647
  • [22] Error control for receiver-driven layered multicast of audio and video
    Chou, PA
    Mohr, AE
    Wang, A
    Mehrotra, S
    IEEE TRANSACTIONS ON MULTIMEDIA, 2001, 3 (01) : 108 - 122
  • [23] WIP: Leveraging QUIC for a Receiver-driven BBR for Cellular Networks
    Haile, Habtegebreil
    Grinnemo, Karl-Johann
    Ferlin, Simone
    Hurtig, Per
    Brunstrom, Anna
    2021 IEEE 22ND INTERNATIONAL SYMPOSIUM ON A WORLD OF WIRELESS, MOBILE AND MULTIMEDIA NETWORKS (WOWMOM 2021), 2021, : 252 - 255
  • [24] Implementing receiver-driven handoffs to the emergency department to reduce miscommunication
    Huth, Kathleen
    Stack, Anne M.
    Hatoun, Jonathan
    Chi, Grace
    Blake, Robert
    Shields, Robert
    Melvin, Patrice
    West, Daniel C.
    Spector, Nancy D.
    Starmer, Amy J.
    BMJ QUALITY & SAFETY, 2021, 30 (03) : 208 - 215
  • [25] NetInf TP: a receiver-driven protocol for ICN data transport
    Potys, Robert Attila
    Ali, Noman Mumtaz
    Marsh, Ian
    Osmani, Flutra
    2015 IEEE 23RD INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2015, : 267 - 272
  • [26] RECEIVER-DRIVEN LAYERED OVERLAY MULTICAST FOR SCALABLE VIDEO STREAMING
    Zou, Junni
    Wang, Min
    Li, Leyang
    Xiong, Hongkai
    ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 794 - +
  • [27] Receiver-driven bandwidth adaptation for light-weight sessions
    Amir, E
    McCanne, S
    Katz, R
    ACM MULTIMEDIA 97, PROCEEDINGS, 1997, : 415 - 426
  • [28] Receiver-driven Congestion Control for Content Oriented Application with Multiple Sources
    Hayamizu, Yusaku
    Yamamoto, Miki
    2015 IEEE INTERNATIONAL WORKSHOP TECHNICAL COMMITTEE ON COMMUNICATIONS QUALITY AND RELIABILITY (CQR), 2015,
  • [29] Receiver-driven bandwidth sharing for TCP and its application to video streaming
    Mehra, P
    De Vleeschouwer, C
    Zakhor, A
    IEEE TRANSACTIONS ON MULTIMEDIA, 2005, 7 (04) : 740 - 752
  • [30] Low-complexity video coding for receiver-driven layered multicast
    McCanne, S
    Vetterli, M
    Jacobson, V
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 1997, 15 (06) : 983 - 1001