EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

被引:16
|
作者
Chakraborty, Sourav [1 ]
Laguna, Ignacio [2 ]
Emani, Murali [2 ]
Mohror, Kathryn [2 ]
Panda, Dhabaleswar K. [1 ]
Schulz, Martin [3 ]
Subramoni, Hari [1 ]
机构
[1] Ohio State Univ, Columbus, OH 43210 USA
[2] Lawrence Livermore Natl Lab, Livermore, CA 94550 USA
[3] Tech Univ Munich, Munich, Germany
来源
关键词
fault tolerance; high-performance computing; MPI; resilience;
D O I
10.1002/cpe.4863
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] A fault tolerance solution for sequential and MPI applications on the grid
    Computer Architecture Group, University of A Coruña, Spain
    [J]. Scalable Comput. Pract. Exp., 2008, 2 (101-109): : 101 - 109
  • [32] A Channel Memory based fault tolerance for MPI applications
    Selikhov, A
    Germain, C
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2005, 21 (05): : 709 - 715
  • [33] MigPF: Towards on self-organizing process rescheduling of Bulk-Synchronous Parallel applications
    Righi, Rodrigo da Rosa
    Gomes, Roberto de Quadros
    Rodrigues, Vinicius Facco
    da Costa, Cristiano Andre
    Alberti, Antonio Marcos
    Pilla, Laercio Lima
    Alexandre Navaux, Philippe Olivier
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 78 : 272 - 286
  • [34] Resilience for Collaborative Applications on Clouds Fault-Tolerance for Distributed HPC Applications
    Toan Nguyen
    Desideri, Jean-Antoine
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2012, PT IV, 2012, 7336 : 418 - 433
  • [35] MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows
    Wozniak, Justin M.
    Dorier, Matthieu
    Ross, Robert
    Shu, Tong
    Kurc, Tahsin
    Tang, Li
    Podhorszki, Norbert
    Wolf, Matthew
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 101 : 576 - 589
  • [36] PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance
    Wang, Cheng
    Chen, Xusheng
    Jia, Weiwei
    Li, Boxuan
    Qiu, Haoran
    Zhao, Shixiong
    Cui, Heming
    [J]. PROCEEDINGS OF THE 15TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI'18), 2018, : 483 - 499
  • [37] On Providing Scalable Self-healing Adaptive Fault-tolerance to RTR SoCs
    Navas, Byron
    Oberg, Johnny
    Sander, Ingo
    [J]. 2014 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2014,
  • [38] Speculations: Providing Fault-tolerance and Improving Performance of Parallel Applications
    Tapus, Cristian
    Hickey, Jason
    [J]. PROCEEDINGS OF THE 2007 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING PPOPP'07, 2007, : 152 - 153
  • [39] Applications of the fault-tolerance best-effort multicast algorithm
    Lau, Peter S.
    [J]. 2006 10th International Conference on Communication Technology, Vols 1 and 2, Proceedings, 2006, : 376 - 379
  • [40] Persistent fault-tolerance for divide-and-conquer applications on the grid
    Wrzesinska, Gosia
    Oprescu, Ana-Maria
    Kielmann, Thilo
    Bal, Henri
    [J]. EURO-PAR 2007 PARALLEL PROCESSING, PROCEEDINGS, 2007, 4641 : 425 - +