An automated fault detection system for communication networks and distributed systems

被引:2
|
作者
Van Nguyen, Sinh [1 ]
Tran, Ha Manh [1 ,2 ]
机构
[1] Int Univ VNUHCM, Sch Comp Sci & Engn, Ho Chi Minh City, Vietnam
[2] Hong Bang Int Univ, Fac Informat Technol, Ho Chi Minh City, Vietnam
关键词
Fault detection; Automation; Machine learning; Random forest; Bug tracking system;
D O I
10.1007/s10489-020-02026-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automating fault detection in communication networks and distributed systems is a challenging process that usually requires the involvement of supporting tools and the expertise of system operators. Automated event monitoring and correlating systems produce event data that is forwarded to system operators for analyzing error events and creating fault reports. Machine learning methods help not only analyzing event data more precisely but also forecasting possible error events by learning from existing faults. This study introduces an automated fault detection system that assists system operators in detecting and forecasting faults. This system is characterized by the capability of exploiting bug knowledge resources at various online repositories, log events and status parameters from the monitored system; and applying bug analysis and event filtering methods for evaluating events and forecasting faults. The system contains a fault data model to collect bug reports, a feature and semantic filtering method to correlate log events, and machine learning methods to evaluate the severity, priority and relation of log events and forecast the forthcoming critical faults of the monitored system. We have evaluated the prototyping implementation of the proposed system on a high performance computing cluster system and provided analysis with lessons learned.
引用
收藏
页码:5405 / 5419
页数:15
相关论文
共 50 条
  • [1] An automated fault detection system for communication networks and distributed systems
    Sinh Van Nguyen
    Ha Manh Tran
    [J]. Applied Intelligence, 2021, 51 : 5405 - 5419
  • [2] Residue Number System for Fault Detection in Communication Networks
    Singh, Tanu
    [J]. 2014 INTERNATIONAL CONFERENCE ON MEDICAL IMAGING, M-HEALTH & EMERGING COMMUNICATION SYSTEMS (MEDCOM), 2015, : 157 - 161
  • [3] Fault Tolerance Communication in Mobile Distributed Networks
    Suganth, D. Bhuvana
    Manjunath, R.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 77 - 87
  • [4] The Communication in Intelligent Distributed Fault Tolerant Systems
    Garza, Arnulfo Alanis
    Serrano, Juan Jose
    Carot, Rafael Ors
    Garcia Valdez, Jose Mario
    [J]. ENGINEERING LETTERS, 2006, 13 (02)
  • [5] Communication fault tolerance in distributed robotic systems
    Molnár, P
    Starke, J
    [J]. DISTRIBUTED AUTONOMOUS ROBOTIC SYSTEMS, 2000, : 99 - 108
  • [6] Automated fault detection and classification of etch systems using modular neural networks
    Hong, SJ
    May, GS
    Yamartino, J
    Skumanich, A
    [J]. DATA ANALYSIS AND MODELING FOR PROCESS CONTROL, 2004, 5378 : 134 - 141
  • [7] Survivability of distributed fault detection systems
    Zhou L.
    Lv H.
    Liu K.
    Zhang J.
    [J]. International Journal of Performability Engineering, 2019, 15 (11) : 3008 - 3015
  • [8] Fault Detection and Localization in Distributed Systems Using Recurrent Convolutional Neural Networks
    Qi, Guangyang
    Yao, Lina
    Uzunov, Anton V.
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2017, 2017, 10604 : 33 - 48
  • [9] A unified theory of fault diagnosis and distributed fault management in communication networks
    Berthet, GG
    Fischer, N
    [J]. AFRICON '96 - 1996 IEEE AFRICON : 4TH AFRICON CONFERENCE IN AFRICA, VOLS I & II: ELECTRICAL ENERGY TECHNOLOGY; COMMUNICATION SYSTEMS; HUMAN RESOURCES, 1996, : 776 - 781
  • [10] Distributed sensor system for fault detection and isolation in multistage manufacturing systems
    Du Shi-Chang
    Xi Li-Feng
    Shi Jian-Jun
    [J]. INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2006, 25 (04) : 182 - 191