Evaluating and extending user-level fault tolerance in MPI applications

被引:26
|
作者
Laguna, Ignacio [1 ]
Richards, David F. [2 ]
Gamblin, Todd [1 ]
Schulz, Martin [1 ]
de Supinski, Bronis R. [3 ]
Mohror, Kathryn [4 ]
Pritchard, Howard [5 ]
机构
[1] Lawrence Livermore Natl Lab, CASC, Livermore, CA USA
[2] Lawrence Livermore Natl Lab, Phys & Life Sci Directorate, Livermore, CA USA
[3] Lawrence Livermore Natl Lab, LC, Livermore, CA USA
[4] Lawrence Livermore Natl Lab, Ctr Appl Sci Comp, Scalabil Team, Livermore, CA USA
[5] Los Alamos Natl Lab, Los Alamos, NM USA
关键词
MPI; fault tolerance; failure recovery models; checkpointing; molecular dynamics simulation;
D O I
10.1177/1094342015623623
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master-worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
引用
收藏
页码:305 / 319
页数:15
相关论文
共 50 条
  • [21] Fault tolerance of MPI applications in exascale systems: The ULFM solution
    Losada, Nuria
    Gonzalez, Patricia
    Martin, Maria J.
    Bosilca, George
    Bouteiller, Aurelien
    Teranishi, Keita
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 106 (106): : 467 - 481
  • [22] Flexible user-level scheduling
    Craig, D
    Polychronopoulos, C
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 93 - 98
  • [23] Fault tolerance for cluster-oriented MPI parallel applications
    Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    Qinghua Daxue Xuebao, 2006, 1 (67-69+110):
  • [24] Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications
    Losada, Nuria
    Martin, Maria J.
    Rodriguez, Gabriel
    Gonzalez, Patricia
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2014, 20 (09) : 1352 - 1372
  • [25] Detecting and Analyzing Year 2038 Problem Bugs in User-level Applications
    Suzuki, Keita
    Kubota, Takafumi
    Kono, Kenji
    2019 IEEE 24TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC 2019), 2019, : 65 - 74
  • [26] Dynamic workload balancing of parallel applications with user-level scheduling on the Grid
    Korkhov, Vladimir V.
    Moscicki, JakubT.
    Krzhizhanovskaya, Valeria V.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2009, 25 (01): : 28 - 34
  • [27] On the User-Level Satisfactions with User-Level Utility Functions: A Case Study with Scheduling in TDMA Wireless Networks
    Kim, Sungyeon
    Lee, Jang-Won
    IEICE TRANSACTIONS ON COMMUNICATIONS, 2010, E93B (04) : 1037 - 1040
  • [28] A user-level framework for auditing and monitoring
    Wu, YZ
    Yap, RHC
    21ST ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, PROCEEDINGS, 2005, : 84 - 94
  • [29] User-level checkpointing for LinuxThreads programs
    Dieter, WR
    Lumpp, JE
    USENIX ASSOCIATION PROCEEDINGS OF THE FREENIX TRACK, 2001, : 81 - 92
  • [30] A toolkit for user-level file systems
    Mazières, D
    USENIX ASSOCIATION PROCEEDINGS OF THE 2001 USENIX ANNUAL TECHNICAL CONFERENCE, 2001, : 261 - 274