Error detection and diagnosis for fault tolerance in distributed systems

被引:5
|
作者
Saleh, K [1 ]
Al-Saqabi, K [1 ]
机构
[1] Kuwait Univ, Dept Elect & Comp Engn, Safat 13060, Kuwait
关键词
communications software; detection diagnosis; distributed systems; fault tolerance;
D O I
10.1016/S0950-5849(97)00058-X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. (C) 1998 Elsevier Science B.V.
引用
收藏
页码:975 / 983
页数:9
相关论文
共 50 条
  • [11] Fault detection and diagnosis in distributed systems: An approach by partially stochastic Petri nets
    Aghasaryan, A
    Fabre, E
    Benveniste, A
    Boubour, R
    Jard, C
    DISCRETE EVENT DYNAMIC SYSTEMS-THEORY AND APPLICATIONS, 1998, 8 (02): : 203 - 231
  • [12] Fault Detection and Diagnosis in Distributed Systems: An Approach by Partially Stochastic Petri Nets
    Armen Aghasaryan
    Eric Fabre
    Albert Benveniste
    Renée Boubour
    Claude Jard
    Discrete Event Dynamic Systems, 1998, 8 : 203 - 231
  • [13] Survivability of distributed fault detection systems
    Zhou L.
    Lv H.
    Liu K.
    Zhang J.
    International Journal of Performability Engineering, 2019, 15 (11) : 3008 - 3015
  • [14] Fault Diagnosis and Fault Tolerance of Drive Systems : Status and Research
    Muenchhof, Marco
    Beck, Mark
    Isermann, Rolf
    EUROPEAN JOURNAL OF CONTROL, 2009, 15 (3-4) : 370 - 388
  • [15] The customizable fault/error model for dependable distributed systems
    Walter, CJ
    Suri, N
    THEORETICAL COMPUTER SCIENCE, 2003, 290 (02) : 1223 - 1251
  • [16] A Fault-tolerance Framework for Distributed Component Systems
    Hamid, Brahim
    Radermacher, Ansgar
    Vanuxeem, Patrick
    Lanusse, Agnes
    Gerard, Sebastien
    PROCEEDINGS OF THE 34TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS, 2008, : 84 - 91
  • [17] A framework for fault tolerance in distributed real time systems
    Malik, S
    Rehman, MJ
    IEEE: 2005 International Conference on Emerging Technologies, Proceedings, 2005, : 505 - 510
  • [19] FLEXIBLE FAULT TOLERANCE FOR DISTRIBUTED COMPUTER-SYSTEMS
    LOQUES, OG
    KRAMER, J
    IEE PROCEEDINGS-E COMPUTERS AND DIGITAL TECHNIQUES, 1986, 133 (06): : 319 - 332
  • [20] Fused State Machines for Fault Tolerance in Distributed Systems
    Balasubramanian, Bharath
    Garg, Vijay K.
    PRINCIPLES OF DISTRIBUTED SYSTEMS, 2011, 7109 : 266 - 282