Fault tolerant cluster computing through replication

被引:0
|
作者
Shum, KH [1 ]
机构
[1] Natl Univ Singapore, Dept Informat Syst & Comp Sci, Singapore 119260, Singapore
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Long-lived parallel applications running on workstation clusters are vulnerable to single-node or multiple-node failures. Fault recovery is therefore required to prevent immature program termination. However, much of the runtime overhead imposed by fault tolerance schemes is generally due to the cost of transferring the checkpoint states of applications by disk I/O operations. In this paper, we propose a fault tolerant model in which checkpoint states are transferred between replicated parallel applications. We also describe how tide resource consumption of the replicated applications can be minimized. The fault tolerant model has been implemented and tested on a workstation cluster and a Fujitsu AP3000 multi-processor machine. The measurements of our experiments have showed that efficient fault tolerance can be achieved by replicating parallel applications on cluster of computers.
引用
收藏
页码:756 / 761
页数:6
相关论文
共 50 条
  • [1] A Fault Tolerant Approach in Cluster Computing System
    Shwe, Thanda
    Aye, Win
    ECTI-CON 2008: PROCEEDINGS OF THE 2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 149 - +
  • [2] Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server
    Bheevgade, Meenakshi
    Patrikar, Rajendra M.
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 28, 2008, 28 : 265 - 268
  • [3] Fault tolerant adaptive parallel and distributed simulation through functional replication
    D'Angelo, Gabriele
    Ferretti, Stefano
    Marzolla, Moreno
    SIMULATION MODELLING PRACTICE AND THEORY, 2019, 93 : 192 - 207
  • [4] The Future of Fault Tolerant Computing
    Abraham, Jacob
    Iyer, Ravishankar
    Gizopoulos, Dimitris
    Alexandrescu, Dan
    Zorian, Yervant
    2015 IEEE 21ST INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2015, : 108 - 109
  • [5] FAULT-TOLERANT COMPUTING
    TOY, WN
    ADVANCES IN COMPUTERS, 1987, 26 : 201 - 279
  • [6] FAULT-TOLERANT COMPUTING
    PRADHAN, DK
    COMPUTER, 1980, 13 (03) : 6 - 7
  • [7] Strengthened Fault Tolerance in Byzantine Fault Tolerant Replication
    Xiang, Zhuolun
    Malkhi, Dahlia
    Nayak, Kartik
    Ren, Ling
    2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 205 - 215
  • [8] Towards fault tolerant pervasive computing
    Chetan, S
    Ranganathan, A
    Campbell, R
    IEEE TECHNOLOGY AND SOCIETY MAGAZINE, 2005, 24 (01) : 38 - 44
  • [9] FAULT-TOLERANT COMPUTING PREFACE
    BUKOWSKI, JV
    IEEE TRANSACTIONS ON RELIABILITY, 1987, 36 (02) : 162 - 163
  • [10] FAULT TOLERANT COMPUTING IN SATELLITES.
    Torin, J.M.
    1600, (29):