Evaluating and extending user-level fault tolerance in MPI applications

被引:26
|
作者
Laguna, Ignacio [1 ]
Richards, David F. [2 ]
Gamblin, Todd [1 ]
Schulz, Martin [1 ]
de Supinski, Bronis R. [3 ]
Mohror, Kathryn [4 ]
Pritchard, Howard [5 ]
机构
[1] Lawrence Livermore Natl Lab, CASC, Livermore, CA USA
[2] Lawrence Livermore Natl Lab, Phys & Life Sci Directorate, Livermore, CA USA
[3] Lawrence Livermore Natl Lab, LC, Livermore, CA USA
[4] Lawrence Livermore Natl Lab, Ctr Appl Sci Comp, Scalabil Team, Livermore, CA USA
[5] Los Alamos Natl Lab, Los Alamos, NM USA
关键词
MPI; fault tolerance; failure recovery models; checkpointing; molecular dynamics simulation;
D O I
10.1177/1094342015623623
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
引用
收藏
页码:305 / 319
页数:15
相关论文
共 50 条
  • [1] User-Level Scheduled Communications for MPI
    Schafer, Derek J.
    Ghafoor, K. Sheikh
    Holmes, Daniel J.
    Rufenacht, Martin
    Skjellum, Anthony
    2019 IEEE 26TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC), 2019, : 290 - 300
  • [2] A User-level Library for Fault Tolerance on Shared Memory Multicore Systems
    Mushtaq, Hamid
    Al-Ars, Zaid
    Bertels, Koen
    2012 IEEE 15TH INTERNATIONAL SYMPOSIUM ON DESIGN AND DIAGNOSTICS OF ELECTRONIC CIRCUITS & SYSTEMS (DDECS), 2012, : 266 - 269
  • [3] An evaluation of User-Level Failure Mitigation support in MPI
    Bland, Wesley
    Bouteiller, Aurelien
    Herault, Thomas
    Hursey, Joshua
    Bosilca, George
    Dongarra, Jack J.
    COMPUTING, 2013, 95 (12) : 1171 - 1184
  • [4] An evaluation of User-Level Failure Mitigation support in MPI
    Wesley Bland
    Aurelien Bouteiller
    Thomas Herault
    Joshua Hursey
    George Bosilca
    Jack J. Dongarra
    Computing, 2013, 95 : 1171 - 1184
  • [5] Extending the MPI Stages Model of Fault Tolerance
    Schafer, Derek
    Laguna, Ignacio
    Skjellum, Anthony
    Sultana, Nawrin
    Mohror, Kathryn
    PROCEEDINGS OF THE EXASCALE MPI WORKSHOP (EXAMPI 2020), 2020, : 52 - 61
  • [6] MPI plus ULT: Overlapping Communication and Computation with User-Level Threads
    Lu, Huiwei
    Seo, Sangmin
    Balaji, Pavan
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 444 - 454
  • [7] Portable Desktop Applications Based on User-Level Virtualization
    Zhang, Youhui
    Wang, Xiaoling
    Hong, Liang Su
    Wang, Dongsheng
    2008 13TH ASIA-PACIFIC COMPUTER SYSTEMS ARCHITECTURE CONFERENCE, 2008, : 217 - 222
  • [8] User-level Framework for Performance Monitoring of HPC Applications
    Hristova, R.
    APPLICATION OF MATHEMATICS IN TECHNICAL AND NATURAL SCIENCES, 2013, 1561 : 144 - 152
  • [9] User-level Remote Memory Paging for Multithreaded Applications
    Midorikawa, Hiroko
    Suzuki, Yuichiro
    Iwaida, Masatoshi
    PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013), 2013, : 196 - +
  • [10] Replication-Based Fault Tolerance for MPI Applications
    Walters, John Paul
    Chaudhary, Vipin
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (07) : 997 - 1010