A FAULT TOLERANCE SOLUTION FOR SEQUENTIAL AND MPI APPLICATIONS ON THE GRID

被引:0
|
作者
Rodriguez, Gabriel [1 ]
Pardo, Xoan C. [1 ]
Martin, Maria J. [1 ]
Gonzalez, Patricia [1 ]
Diaz, Daniel [1 ]
机构
[1] Univ A Coruna, Comp Architecture Grp, La Coruna, Spain
来源
关键词
fault tolerance; grid computing; Globus; MPI; checkpointing;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The Grid community has made an important effort in developing middleware to provide different functionalities, such as resource discovery, resource management, job submission or execution monitoring. As part of this effort this paper addresses the design and implementation of an architecture (CPPC-G) based on services to manage the execution of fault tolerant applications on Grids. The CPPC (Controller/Precompiler for Portable Checkpointing) framework is used to insert checkpoint instrumentation into the code of sequential and MPI applications. Designed services will be in charge of submission and monitoring of the execution of CPPC-instrumented applications, management of checkpoint files generated by the fault-tolerant applications, and detection and automatic restart of failed executions.
引用
收藏
页码:101 / 109
页数:9
相关论文
共 50 条
  • [1] A fault tolerance solution for sequential and MPI applications on the grid
    Computer Architecture Group, University of A Coruña, Spain
    [J]. Scalable Comput. Pract. Exp., 2008, 2 (101-109): : 101 - 109
  • [2] Fault tolerance of MPI applications in exascale systems: The ULFM solution
    Losada, Nuria
    Gonzalez, Patricia
    Martin, Maria J.
    Bosilca, George
    Bouteiller, Aurelien
    Teranishi, Keita
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 106 (106): : 467 - 481
  • [3] Replication-Based Fault Tolerance for MPI Applications
    Walters, John Paul
    Chaudhary, Vipin
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (07) : 997 - 1010
  • [4] A Channel Memory based fault tolerance for MPI applications
    Selikhov, A
    Germain, C
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2005, 21 (05): : 709 - 715
  • [5] Transparent fault tolerance for grid applications
    Garbacki, P
    Biskupski, B
    Bal, H
    [J]. ADVANCES IN GRID COMPUTING - EGC 2005, 2005, 3470 : 671 - 680
  • [6] Scheduling in grid: Rescheduling MPI applications using a fault-tolerant MPI implementation
    Reddy, M. Vivekananda
    Chaudhary, Sanjay
    [J]. 2007 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS SOFTWARE & MIDDLEWARE, VOLS 1 AND 2, 2007, : 706 - +
  • [7] Proactive fault tolerance in MPI applications via task migration
    Chakravorty, Sayantan
    Mendes, Celso L.
    Kale, Laxmikant V.
    [J]. HIGH PERFORMANCE COMPUTING - HIPC 2006, PROCEEDINGS, 2006, 4297 : 485 - +
  • [8] Fault tolerance for cluster-oriented MPI parallel applications
    Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    [J]. Qinghua Daxue Xuebao, 2006, 1 (67-69+110):
  • [9] Migol: A fault-tolerant service framework for MPI applications in the grid
    Luckow, Andre
    Schnor, Bettina
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2008, 24 (02): : 142 - 152
  • [10] Migol: A fault-tolerant service framework for MPI applications in the grid
    Luckow, A
    Schnor, B
    [J]. RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, PROCEEDINGS, 2005, 3666 : 258 - 267