Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

被引:3
|
作者
Losada, Nuria [1 ]
Bouteiller, Aurelien [1 ]
Bosilca, George [1 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
基金
美国国家科学基金会;
关键词
fault tolerance; MPI; User Level Fault Mitigation; ULFM; message logging; checkpoint/restart; MESSAGE; PERFORMANCE; RECOVERY;
D O I
10.1109/FTXS49593.2019.00006
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 50 条
  • [41] Low-complexity video coding for receiver-driven layered multicast
    IEEE J Sel Areas Commun, 6 (983-1001):
  • [42] CCTCP: A Scalable Receiver-driven Congestion Control Protocol for Content Centric Networking
    Saino, Lorenzo
    Cocora, Cosmin
    Pavlou, George
    2013 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2013, : 3775 - 3780
  • [43] PRIME: Peer-to-Peer Receiver-drIven MEsh-based streaming
    Magharei, Nazanin
    Rejaie, Reza
    INFOCOM 2007, VOLS 1-5, 2007, : 1415 - +
  • [44] PRIME: Peer-to-Peer Receiver-Driven Mesh-Based Streaming
    Magharei, Nazanin
    Rejaie, Reza
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2009, 17 (04) : 1052 - 1065
  • [45] Low complexity adaptive error control for receiver-driven layered video multicast
    Ou, Chien-Min
    Hwang, Wen-Jyi
    Lo, Tsung-Yen
    Wei, Hui-Hsien
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2006, 29 (07) : 1215 - 1226
  • [46] REN: Receiver-Driven Congestion Control Using Explicit Notification for Data Center
    Li, Zhaoyi
    Huang, Jiawei
    Hu, Jinbin
    Li, Weihe
    Zhang, Tao
    Liu, Jingling
    Wang, Jianxin
    He, Tian
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (02) : 1381 - 1394
  • [47] Receiver-driven fair congestion control for TCP outcast in data center networks
    Huang, Jiawei
    Li, Shuping
    Han, Rui
    Wang, Jianxin
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2019, 131 : 75 - 88
  • [48] A receiver-driven transport protocol using differentiated algorithms for differential congestion in datacenters
    Zhang, Xuhui
    Li, Qing
    Han, Feixue
    Jiang, Yong
    COMPUTER NETWORKS, 2024, 245
  • [49] Performance Evaluation of Intermittent Receiver-Driven Data Transmission on Wireless Sensor Networks
    Kominami, Daichi
    Sugano, Masashi
    Murata, Masayuki
    Hatauchi, Takaaki
    Fukuyama, Yoshikazu
    2009 6TH INTERNATIONAL SYMPOSIUM ON WIRELESS COMMUNICATION SYSTEMS (ISWCS 2009), 2009, : 141 - +
  • [50] TOWARD INFORMATION-CENTRIC NETWORKING RECEIVER-DRIVEN TRANSMISSION MECHANISM OVER WIRELESS LOCAL AREA NETWORK: IMPLEMENTATION AND OPTIMIZATION
    Liu, Yifeng
    Zeng, Xuewen
    Han, Rui
    Sun, Peng
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2021, 17 (03): : 853 - 871