Adaptive diagnosis in distributed systems

被引:90
|
作者
Rish, I [1 ]
Brodie, M
Ma, S
Odintsova, N
Beygelzimer, A
Grabarnik, G
Hernandez, K
机构
[1] IBM Corp, TJ Watson Res Ctr, Hawthorne, NY 10532 USA
[2] IBM Syst & Technol Grp, Austin, TX 78758 USA
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2005年 / 16卷 / 05期
关键词
Bayesian networks (BNs); computer networks; diagnosis; distributed systems; end-to-end transactions; information gain; probabilistic inference;
D O I
10.1109/TNN.2005.853423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing. Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with nonadaptive (preplanned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using dynamic Bayesian networks (DBNs), and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.
引用
收藏
页码:1088 / 1109
页数:22
相关论文
共 50 条
  • [41] Special track on dependable and adaptive distributed systems
    Goeschka, Karl M.
    Oliveira, Rui
    Pietzuch, Peter
    Russello, Giovanni
    [J]. Proceedings of the ACM Symposium on Applied Computing, 2015, 13-17-April-2015 : 426 - 427
  • [42] Scalable, Adaptive Load Sharing for Distributed Systems
    Kremien, Orly
    Kramer, Jeff
    Magee, Jeff
    [J]. IEEE Parallel and Distributed Technology, 1993, 1 (03): : 62 - 70
  • [43] Adaptive Architectures for Distributed Control of Modular Systems
    Yucelen, Tansel
    Shamma, Jeff S.
    [J]. 2014 AMERICAN CONTROL CONFERENCE (ACC), 2014, : 1328 - 1333
  • [44] The active streams approach to adaptive distributed systems
    Bustamante, FE
    Eisenhauer, G
    Schwan, K
    [J]. 10TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2001, : 437 - 438
  • [45] Adaptive communication algorithms for distributed heterogeneous systems
    Bhat, PB
    Prasanna, VK
    Raghavendra, CS
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1999, 59 (02) : 252 - 279
  • [46] ADAPTIVE COORDINATION IN DISTRIBUTED SYSTEMS WITH DELAYED COMMUNICATION
    BILLARD, EA
    PASQUALE, JC
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1995, 25 (04): : 546 - 554
  • [47] Timeliness in auto-adaptive distributed systems
    Pal, P
    Schantz, RE
    Loyall, JP
    [J]. 24TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS, PROCEEDINGS, 2004, : 354 - 359
  • [48] ADAPTIVE ROUTING TECHNIQUES FOR DISTRIBUTED COMMUNICATIONS SYSTEMS
    BOEHM, BW
    MOBLEY, RL
    [J]. IEEE TRANSACTIONS ON COMMUNICATION TECHNOLOGY, 1969, CO17 (03): : 340 - &
  • [49] ADAPTIVE LOAD SHARING IN HETEROGENEOUS DISTRIBUTED SYSTEMS
    MIRCHANDANEY, R
    TOWSLEY, D
    STANKOVIC, JA
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1990, 9 (04) : 331 - 346
  • [50] Distributed neural structures in adaptive eLearning systems
    Pupezescu, Valentin
    [J]. PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON VIRTUAL LEARNING, 2016, : 302 - 309