Adaptive diagnosis in distributed systems

被引:90
|
作者
Rish, I [1 ]
Brodie, M
Ma, S
Odintsova, N
Beygelzimer, A
Grabarnik, G
Hernandez, K
机构
[1] IBM Corp, TJ Watson Res Ctr, Hawthorne, NY 10532 USA
[2] IBM Syst & Technol Grp, Austin, TX 78758 USA
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2005年 / 16卷 / 05期
关键词
Bayesian networks (BNs); computer networks; diagnosis; distributed systems; end-to-end transactions; information gain; probabilistic inference;
D O I
10.1109/TNN.2005.853423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing. Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with nonadaptive (preplanned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using dynamic Bayesian networks (DBNs), and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.
引用
收藏
页码:1088 / 1109
页数:22
相关论文
共 50 条
  • [1] Leader Based Adaptive Fault Diagnosis Algorithm for Distributed Systems
    Manghwani, Juhi
    Taware, Rutuja
    Kelkar, Supriya
    Chinde, Priyanka
    Alwani, Saloni
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION, INSTRUMENTATION AND CONTROL (ICICIC), 2017,
  • [2] Coordinator-based Adaptive Fault Diagnosis Algorithm for Distributed Computing Systems
    Kelkar, Supriya
    Yeole, Deepali G.
    Sinkar, Mayuri B.
    Jagtap, Priyanka B.
    Zagade, Damini S.
    [J]. 2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 745 - 751
  • [3] On Adaptive Distributed Storage Systems
    Rai, B. K.
    Dhoorjati, V.
    Saini, L.
    Jha, A. K.
    [J]. 2015 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2015, : 1482 - 1486
  • [4] Distributed chronicle for the fault diagnosis in distributed systems
    Aguilar, Jose
    Vizcarrondo, Juan
    [J]. INTERNATIONAL JOURNAL OF COMMUNICATION NETWORKS AND DISTRIBUTED SYSTEMS, 2020, 24 (03) : 284 - 315
  • [5] Distributed chronicle for the fault diagnosis in distributed systems
    Aguilar, Jose
    Vizcarrondo, Juan
    [J]. International Journal of Communication Networks and Distributed Systems, 2020, 24 (03): : 284 - 315
  • [6] Adaptive Distributed Fault Diagnosis Design for Large-scale Networks of Nonlinear Systems
    Rahme, Sandy
    Meskin, Nader
    [J]. 2017 IEEE 56TH ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2017,
  • [7] An architecture for distributed diagnosis systems
    Senior, C
    Dore, A
    Laengle, T
    Albert, M
    [J]. MULTI-AGENT-SYSTEMS IN PRODUCTION, 2000, : 219 - 224
  • [8] Distributed diagnosis for qualitative systems
    Su, R
    Wonham, WM
    Kurien, J
    Koutsoukos, X
    [J]. WODES'02: SIXTH INTERNATIONAL WORKSHOP ON DISCRETE EVENT SYSTEMS, PROCEEDINGS, 2002, : 169 - 174
  • [9] An Adaptive Protection Scheme for Distributed Systems with Distributed Generation
    Ma, Jing
    Mi, Chao
    Wang, Tong
    Wu, Jie
    Wang, Zengping
    [J]. 2011 IEEE POWER AND ENERGY SOCIETY GENERAL MEETING, 2011,
  • [10] Constructing adaptive software in distributed systems
    Chen, WK
    Hiltunen, MA
    Schlichting, RD
    [J]. 21ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2001, : 635 - 643