EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

被引:16
|
作者
Chakraborty, Sourav [1 ]
Laguna, Ignacio [2 ]
Emani, Murali [2 ]
Mohror, Kathryn [2 ]
Panda, Dhabaleswar K. [1 ]
Schulz, Martin [3 ]
Subramoni, Hari [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
[2] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA
[3] Tech Univ Munich, Munich, Germany
来源
关键词
fault tolerance; high-performance computing; MPI; resilience;
D O I
10.1002/cpe.4863
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] A portable fault-tolerance scheme for MPI
    Louca, S
    Neophytou, N
    Evripidou, P
    [J]. INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 690 - 697
  • [2] Slack-conscious Lightweight Loop Scheduling for Improving Scalability of Bulk-synchronous MPI Applications
    Kale, Vivek
    Gamblin, Todd
    Hoefler, Torsten
    de Supinski, Bronis R.
    Gropp, William D.
    [J]. 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1392 - 1392
  • [3] Enhancing fault-tolerance of large-scale MPI scientific applications
    Rodriguez, G.
    Gonzalez, P.
    Martin, M. J.
    Tourino, J.
    [J]. PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS, 2007, 4671 : 153 - 161
  • [4] Efficient, scalable migration of IP telephony calls for enhanced fault-tolerance
    Marwah, M
    Chavez, D
    Gillespie, D
    Velamala, V
    [J]. ICCCN 2005: 14th International Conference on Computer Communications and Networks, Proceedings, 2005, : 517 - 522
  • [5] Efficient Byzantine Fault-Tolerance
    Veronese, Giuliana Santos
    Correia, Miguel
    Bessani, Alysson Neves
    Lung, Lau Cheuk
    Verissimo, Paulo
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (01) : 16 - 30
  • [6] Three-Layer MPI Fault-Tolerance Techniques
    Guo Yucheng
    Wu Peng
    Tang Xiaoyi
    Guo Qingping
    [J]. 2013 12TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING & SCIENCE (DCABES), 2013, : 146 - 149
  • [7] Efficient longest common subsequence computation using bulk-synchronous parallelism
    Krusche, Peter
    Tiskin, Alexander
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2006, PT 5, 2006, 3984 : 165 - 174
  • [8] MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
    Sultana, Nawrin
    Skjellum, Anthony
    Laguna, Ignacio
    Farmer, Matthew Shane
    Mohror, Kathryn
    Emani, Murali
    [J]. EUROMPI 2018: PROCEEDINGS OF THE 25TH EUROPEAN MPI USERS' GROUP MEETING, 2018,
  • [9] Scalable Distributed Consensus to Support MPI Fault Tolerance
    Buntinas, Darius
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 1240 - 1249
  • [10] Scalable Distributed Consensus to Support MPI Fault Tolerance
    Buntinas, Darius
    [J]. RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2011, 6960 : 325 - 328