MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

被引:59
|
作者
Liu, Dewei [1 ]
He, Chuan [1 ]
Peng, Xin [1 ]
Lin, Fan [2 ]
Zhang, Chenxi [1 ]
Gong, Shengfang [2 ]
Li, Ziang [2 ]
Ou, Jiayu [2 ]
Wu, Zheshun [2 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
关键词
microservice; availability; root cause localization; anomaly detection; service call graph; GRAPH;
D O I
10.1109/ICSE-SEIP52600.2021.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.
引用
收藏
页码:338 / 347
页数:10
相关论文
共 50 条
  • [41] Detection and Localization of Load Redistribution Attacks on Large-scale Systems
    Pinceti, Andrea
    Sankar, Lalitha
    Kosut, Oliver
    [J]. JOURNAL OF MODERN POWER SYSTEMS AND CLEAN ENERGY, 2022, 10 (02) : 361 - 370
  • [42] Progress in Root Cause and Fault Propagation Analysis of Large-Scale Industrial Processes
    Yang, Fan
    Xiao, Deyun
    [J]. JOURNAL OF CONTROL SCIENCE AND ENGINEERING, 2012, 2012
  • [43] Efficient Large-Scale Energy Storage Dispatch: Challenges in Future High Renewable Systems
    O'Dwyer, Ciara
    Ryan, Lisa
    Flynn, Damian
    [J]. IEEE TRANSACTIONS ON POWER SYSTEMS, 2017, 32 (05) : 3439 - 3450
  • [44] Monitoring high-dimensional data for failure detection and localization in large-scale computing systems
    Chen, Haifeng
    Jiang, Guofei
    Yoshihira, Kenji
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (01) : 13 - 25
  • [45] Efficient Detection of Cloned Attacks for Large-Scale RFID Systems
    Liu, Xiulong
    Qi, Heng
    Li, Keqiu
    Wu, Jie
    Xue, Weilian
    Min, Geyong
    Xiao, Bin
    [J]. ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT I, 2014, 8630 : 85 - 99
  • [46] Efficient array coding scheme for large-scale storage systems
    Tang, Dan
    [J]. Journal of Electronic Science and Technology, 2015, 13 (02) : 102 - 106
  • [47] Wonder: Efficient Tag Identification for Large-scale RFID Systems
    Liu, Haoxiang
    Liu, Kebin
    Gong, Wei
    Liu, Yunhao
    Chen, Lei
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING IN SENSOR SYSTEMS (IEEE DCOSS 2014), 2014, : 127 - 134
  • [48] MODERNIZATION OF LARGE-SCALE IRRIGATION SYSTEMS: IS IT AN ACHIEVABLE OBJECTIVE OR A LOST CAUSE?
    Plusquellec, Herve
    [J]. IRRIGATION AND DRAINAGE, 2009, 58 : S104 - S120
  • [49] Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems
    Xie, Lei
    Han, Hao
    Li, Qun
    Wu, Jie
    Lu, Sanglu
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (09) : 2421 - 2433
  • [50] Efficient computer simulation of large-scale oscillatory systems dynamics
    Belkov, AY
    [J]. CONTROL OF OSCILLATIONS AND CHAOS - 1997 1ST INTERNATIONAL CONFERENCE, PROCEEDINGS, VOLS 1-3, 1997, : 335 - 337