Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

被引:3
|
作者
Losada, Nuria [1 ]
Bouteiller, Aurelien [1 ]
Bosilca, George [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
基金
美国国家科学基金会;
关键词
fault tolerance; MPI; User Level Fault Mitigation; ULFM; message logging; checkpoint/restart; MESSAGE; PERFORMANCE; RECOVERY;
D O I
10.1109/FTXS49593.2019.00006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
  • [31] Receiver-driven rate-distortion optimized streaming of light fields
    Ramanathan, P
    Girod, B
    2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 3857 - 3860
  • [32] Retrospect: Deterministic replay of MPI applications for interactive distributed debugging
    Bouteiller, Aurelien
    Bosilca, George
    Dongarra, Jack
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2007, 4757 : 297 - 306
  • [33] Energy-Efficient Receiver-Driven Wireless Mesh Sensor Networks
    Kominami, Daichi
    Sugano, Masashi
    Murata, Masayuki
    Hatauchi, Takaaki
    SENSORS, 2011, 11 (01) : 111 - 137
  • [34] RPO: Receiver-driven Transport Protocol Using Opportunistic Transmission in Data Center
    Hu, Jinbin
    Huang, Jiawei
    Li, Zhaoyi
    Li, Yijun
    Jiang, Wenchao
    Chen, Kai
    Wang, Jianxin
    He, Tian
    2021 IEEE 29TH INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP 2021), 2021,
  • [35] Receiver-driven Video Multicast over NOMA Systems in Heterogeneous Environments
    Jiang, Xiaoda
    Lu, Hancheng
    Chen, Chang Wen
    Wu, Feng
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2019), 2019, : 982 - 990
  • [36] RecFlow: SDN-based receiver-driven flow scheduling in datacenters
    Khan, Aadil Zia
    Qazi, Ihsan Ayyub
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (01): : 289 - 306
  • [37] RBBR: A Receiver-Driven BBR in QUIC for Low-Latency in Cellular Networks
    Haile, Habtegebreil
    Grinnemo, Karl-Johan
    Hurtig, Per
    Brunstrom, Anna
    IEEE ACCESS, 2022, 10 : 18707 - 18719
  • [38] Receiver-Driven RDMA Congestion Control by Differentiating Congestion Types in Datacenter Networks
    Zhang, Jiao
    Shi, Jiaming
    Zhong, Xiaolong
    Wan, Zirui
    Tian, Yu
    Pan, Tian
    Huang, Tao
    2021 IEEE 29TH INTERNATIONAL CONFERENCE ON NETWORK PROTOCOLS (ICNP 2021), 2021,
  • [39] Fast-response receiver-driven layered multicast with multiple servers
    Chiu, HS
    Yeung, KL
    2005 ASIA-PACIFIC CONFERENCE ON COMMUNICATIONS (APCC), VOLS 1& 2, 2005, : 259 - 263
  • [40] RecFlow: SDN-based receiver-driven flow scheduling in datacenters
    Aadil Zia Khan
    Ihsan Ayyub Qazi
    Cluster Computing, 2020, 23 : 289 - 306