Fault tolerant cluster computing through replication

被引:0
|
作者
Shum, KH [1 ]
机构
[1] Natl Univ Singapore, Dept Informat Syst & Comp Sci, Singapore 119260, Singapore
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Long-lived parallel applications running on workstation clusters are vulnerable to single-node or multiple-node failures. Fault recovery is therefore required to prevent immature program termination. However, much of the runtime overhead imposed by fault tolerance schemes is generally due to the cost of transferring the checkpoint states of applications by disk I/O operations. In this paper, we propose a fault tolerant model in which checkpoint states are transferred between replicated parallel applications. We also describe how tide resource consumption of the replicated applications can be minimized. The fault tolerant model has been implemented and tested on a workstation cluster and a Fujitsu AP3000 multi-processor machine. The measurements of our experiments have showed that efficient fault tolerance can be achieved by replicating parallel applications on cluster of computers.
引用
收藏
页码:756 / 761
页数:6
相关论文
共 50 条
  • [31] THE CONSENSUS PROBLEM IN FAULT-TOLERANT COMPUTING
    BARBORAK, M
    MALEK, M
    DAHBURA, A
    COMPUTING SURVEYS, 1993, 25 (02) : 171 - 220
  • [32] FAULT-TOLERANT COMPUTING - INTRODUCTION AND AN OVERVIEW
    RAMAMOORTHY, CV
    IEEE TRANSACTIONS ON COMPUTERS, 1971, C 20 (11) : 1241 - +
  • [33] Robust TCP connections for fault tolerant computing
    Ekwall, R
    Urbán, P
    Schiper, A
    NINTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 501 - 508
  • [34] A short history of fault-tolerant computing
    Avizienis, Algirdas
    IT - Information Technology, 1988, 30 (03): : 162 - 168
  • [35] Investigating fault tolerant computing systems reliability
    Distefano, Salvatore
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 314 - +
  • [36] Abstractions for fault-tolerant global computing
    Chothia, T
    Duggan, D
    THEORETICAL COMPUTER SCIENCE, 2004, 322 (03) : 567 - 613
  • [37] Fault tolerant computing in computational field model
    Uehara, M
    INTERNATIONAL CONFERENCE AND WORKSHOP ON ENGINEERING OF COMPUTER-BASED SYSTEMS, PROCEEDINGS, 1997, : 34 - 37
  • [38] Immune system and fault-tolerant computing
    Xanthakis, S
    Karapoulios, S
    Pajot, R
    Rozz, A
    ARTIFICIAL EVOLUTION, 1996, 1063 : 181 - 197
  • [39] A THEORETICIANS VIEW OF FAULT TOLERANT DISTRIBUTED COMPUTING
    FISCHER, MJ
    LECTURE NOTES IN COMPUTER SCIENCE, 1990, 448 : 1 - 9
  • [40] Early Fault-Tolerant Quantum Computing
    Katabarwa, Amara
    Gratsea, Katerina
    Caesura, Athena
    Johnson, Peter D.
    PRX QUANTUM, 2024, 5 (02):