Building and using a fault-tolerant MPI implementation

被引:12
|
作者
Fagg, GE
Dongarra, JJ
机构
[1] High Performance Comp Ctr Stuttgart, D-70550 Stuttgart, Germany
[2] Univ Tennessee, Dept Comp Sci, Knoxville, TN 37996 USA
来源
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2004年 / 18卷 / 03期
关键词
fault tolerant; message passing; parallel computing; MPI;
D O I
10.1177/1094342004046052
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.
引用
收藏
页码:353 / 361
页数:9
相关论文
共 50 条
  • [1] Scheduling in grid: Rescheduling MPI applications using a fault-tolerant MPI implementation
    Reddy, M. Vivekananda
    Chaudhary, Sanjay
    2007 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS SOFTWARE & MIDDLEWARE, VOLS 1 AND 2, 2007, : 706 - +
  • [2] FAIL-MPI: How fault-tolerant is fault-tolerant MPI?
    Hoarau, William
    Lemarinier, Pierre
    Herault, Thomas
    Rodriguez, Eric
    Tixeuil, Sebastien
    Cappello, Franck
    2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 133 - +
  • [3] Design, implementation and performance of fault-tolerant message passing interface (MPI)
    Selvakumar, AD
    Sobha, PM
    Ravindra, GC
    Pitchiah, R
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2004, : 145 - 150
  • [4] Design, implementation and performance of Fault-Tolerant message passing interface (MPI)
    Selvakumar, AD
    Sobha, PM
    Ravindra, GC
    Pitchiah, R
    SEVENTH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND GRID IN ASIA PACIFIC REGION, PROCEEDINGS, 2004, : 120 - 129
  • [5] SHIELD: A fault-tolerant MPI for an infiniband cluster
    Han, Hyuck
    Jung, Hyungsoo
    Kim, Jai Wug
    Lee, Jongpil
    Yu, Youngjin
    Kim, Shin Gyu
    Yeom, Heon Y.
    HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2006, 4208 : 874 - 883
  • [6] Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver
    Ali, Md Mohsin
    Southern, James
    Strazdins, Peter
    Harding, Brendan
    PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, : 1170 - 1179
  • [7] Building Fault-Tolerant Embedded System Using Virtualization
    Sohn, Sunghoon
    ADVANCED SCIENCE LETTERS, 2016, 22 (11) : 3628 - 3632
  • [8] Fault-tolerant solutions for a MPI compute intensive application
    Mourino, J. C.
    Martin, M. J.
    Gonzalez, P.
    Doallo, R.
    15TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2007, : 246 - +
  • [9] Implementation of fault-tolerant GridRPC applications
    Tanimura Y.
    Ikegami T.
    Nakada H.
    Tanaka Y.
    Sekiguchi S.
    Journal of Grid Computing, 2006, 4 (2) : 145 - 157
  • [10] A Fault-Tolerant Implementation of the Median Filter
    Alberto Aranda, Luis
    Reviriego, Pedro
    Antonio Maestro, Juan
    2016 16TH EUROPEAN CONFERENCE ON RADIATION AND ITS EFFECTS ON COMPONENTS AND SYSTEMS (RADECS), 2016,