Fault tolerant cluster computing through replication

被引:0
|
作者
Shum, KH [1 ]
机构
[1] Natl Univ Singapore, Dept Informat Syst & Comp Sci, Singapore 119260, Singapore
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Long-lived parallel applications running on workstation clusters are vulnerable to single-node or multiple-node failures. Fault recovery is therefore required to prevent immature program termination. However, much of the runtime overhead imposed by fault tolerance schemes is generally due to the cost of transferring the checkpoint states of applications by disk I/O operations. In this paper, we propose a fault tolerant model in which checkpoint states are transferred between replicated parallel applications. We also describe how tide resource consumption of the replicated applications can be minimized. The fault tolerant model has been implemented and tested on a workstation cluster and a Fujitsu AP3000 multi-processor machine. The measurements of our experiments have showed that efficient fault tolerance can be achieved by replicating parallel applications on cluster of computers.
引用
收藏
页码:756 / 761
页数:6
相关论文
共 50 条
  • [21] FAULT-TOLERANT COMPUTING - INTRODUCTION
    SCHERTZ, DR
    IEEE TRANSACTIONS ON COMPUTERS, 1974, C-23 (07) : 649 - 650
  • [22] FAULT-TOLERANT COMPUTING - INTRODUCTION
    REDDY, SM
    IEEE TRANSACTIONS ON COMPUTERS, 1978, 27 (06) : 481 - 482
  • [23] SPECIAL ISSUE - FAULT TOLERANT COMPUTING
    LOMBARDI, F
    MICROPROCESSING AND MICROPROGRAMMING, 1987, 20 (4-5): : 233 - 233
  • [24] FAULT-TOLERANT COMPUTING - OVERVIEW
    AVIZIENIS, A
    COMPUTER, 1971, 4 (01) : 5 - +
  • [25] FAULT-TOLERANT COMPUTING - INTRODUCTION
    MEYER, JF
    RAULT, JC
    IEEE TRANSACTIONS ON COMPUTERS, 1976, 25 (06) : 553 - 556
  • [26] Cluster fault tolerant routing in hypercubes
    Gu, QP
    Peng, ST
    1998 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING - PROCEEDINGS, 1998, : 148 - 155
  • [27] FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing
    Yang, Xuejun
    Du, Yunfei
    Wang, Panfeng
    Fu, Hongyi
    Jia, Jia
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (10) : 1471 - 1486
  • [28] IEC 61499 REPLICATION FOR FAULT TOLERANT SYSTEM
    Santos, Adriano A.
    de Sousa, Mario
    Magalhaes, Pessoa
    da Silva, Antonio F.
    IRF2016: 5TH INTERNATIONAL CONFERENCE INTEGRITY-RELIABILITY-FAILURE, 2016, : 849 - 850
  • [29] Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing
    Cui, Xiaolong
    Mills, Bryan
    Znati, Taieb
    Melhem, Rami
    ENERGIES, 2014, 7 (08) : 5151 - 5176
  • [30] FAULT-TOLERANT COMPUTING - INTRODUCTION AND A PERSPECTIVE
    KIME, CR
    IEEE TRANSACTIONS ON COMPUTERS, 1975, C 24 (05) : 457 - 460