Error detection and diagnosis for fault tolerance in distributed systems

被引:5
|
作者
Saleh, K [1 ]
Al-Saqabi, K [1 ]
机构
[1] Kuwait Univ, Dept Elect & Comp Engn, Safat 13060, Kuwait
关键词
communications software; detection diagnosis; distributed systems; fault tolerance;
D O I
10.1016/S0950-5849(97)00058-X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. (C) 1998 Elsevier Science B.V.
引用
收藏
页码:975 / 983
页数:9
相关论文
共 50 条
  • [41] Fault Tolerance in Distributed Systems Using Fused Data Structures
    Balasubramanian, Bharath
    Garg, Vijay K.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013, 24 (04) : 701 - 715
  • [42] AN EFFICIENT RECOVERY PROCEDURE FOR FAULT-TOLERANCE IN DISTRIBUTED SYSTEMS
    SALEH, K
    AHMAD, I
    ALSAQABI, K
    AGARWAL, A
    JOURNAL OF SYSTEMS AND SOFTWARE, 1994, 25 (01) : 39 - 50
  • [43] ON FAULT-TOLERANCE MECHANISMS IN DISTRIBUTED COMPUTER SYSTEMS.
    Eberbach, Eugeniusz
    Just, Jan R.
    1600, (16): : 4 - 5
  • [44] ON FAULT-TOLERANCE MECHANISMS IN DISTRIBUTED COMPUTER-SYSTEMS
    EBERBACH, E
    JUST, JR
    MICROPROCESSING AND MICROPROGRAMMING, 1985, 16 (4-5): : 239 - 244
  • [45] Fault tolerance in distributed systems using fused state machines
    Bharath Balasubramanian
    Vijay K. Garg
    Distributed Computing, 2014, 27 : 287 - 311
  • [46] Fault tolerance in distributed systems using fused state machines
    Balasubramanian, Bharath
    Garg, Vijay K.
    DISTRIBUTED COMPUTING, 2014, 27 (04) : 287 - 311
  • [47] Fault tolerance through automated diversity in the management of distributed systems
    Preissinger, Joerg
    IMECS 2008: INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2008, : 933 - 939
  • [48] A new algorithm for increasing fault-tolerance of distributed systems
    Dishabi, Mohammad Reza Ebrahimi
    Sharifi, Mohsen
    PROCEEDINGS OF THE SIXTH IASTED INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORKS, 2007, : 96 - +
  • [49] Application of Regenerating Codes for Fault Tolerance in Distributed Storage Systems
    Peter, Kathrin
    Sobe, Peter
    2012 11TH IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS (NCA), 2012, : 67 - 70
  • [50] Availability, resilience, and fault tolerance of internet and distributed computing systems
    Xiang, Yang
    Pathan, Mukaddim
    Wei, Guiyi
    Fortino, Giancarlo
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (10): : 2503 - 2505