MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

被引:59
|
作者
Liu, Dewei [1 ]
He, Chuan [1 ]
Peng, Xin [1 ]
Lin, Fan [2 ]
Zhang, Chenxi [1 ]
Gong, Shengfang [2 ]
Li, Ziang [2 ]
Ou, Jiayu [2 ]
Wu, Zheshun [2 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
关键词
microservice; availability; root cause localization; anomaly detection; service call graph; GRAPH;
D O I
10.1109/ICSE-SEIP52600.2021.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.
引用
收藏
页码:338 / 347
页数:10
相关论文
共 50 条
  • [1] TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
    Ding, Ruomeng
    Zhang, Chaoyun
    Wang, Lu
    Xu, Yong
    Ma, Minghua
    Wu, Xiaomin
    Zhang, Meng
    Chen, Qingjun
    Gao, Xin
    Gao, Xuedong
    Fan, Hao
    Rajmohan, Saravan
    Lin, Qingwei
    Zhang, Dongmei
    [J]. PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 1762 - 1773
  • [2] ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures
    Ma, Meng
    Lin, Weilan
    Pan, Disheng
    Wang, Ping
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (05) : 3087 - 3100
  • [3] Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture
    Ma, Meng
    Lin, Weilan
    Pan, Disheng
    Wang, Ping
    [J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (03) : 1399 - 1410
  • [4] Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems
    Zhang, Shenglin
    Pan, Zhongjie
    Liu, Heng
    Jin, Pengxiang
    Sun, Yongqian
    Ouyang, Qianyu
    Wang, Jiaju
    Jia, Xueying
    Zhang, Yuzhi
    Yang, Hui
    Zou, Yongqiang
    Pei, Dan
    [J]. 2023 IEEE 34TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING, ISSRE, 2023, : 69 - 79
  • [5] Design of large-scale, high-efficient, vertical wind turbine
    Lee, S.
    Song, W. -S.
    Kim, H. -R.
    Park, J. -G.
    [J]. FEDSM 2007: PROCEEDINGS OF THE 5TH JOINT ASME/JSME FLUIDS ENGINEERING SUMMER CONFERENCE, VOL 2, PTS A AND B, 2007, : 1095 - 1102
  • [6] Practical Root Cause Localization for Microservice Systems via Trace Analysis
    Li, Zeyan
    Chen, Junjie
    Jiao, Rui
    Zhao, Nengwen
    Wang, Zhijun
    Zhang, Shuwei
    Wu, Yanjun
    Jiang, Long
    Yan, Leiqin
    Wang, Zikai
    Chen, Zhekang
    Zhang, Wenchi
    Nie, Xiaohui
    Sui, Kaixin
    Pei, Dan
    [J]. 2021 IEEE/ACM 29TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2021,
  • [7] MicroIRC: Instance-level Root Cause Localization for Microservice Systems
    Zhu, Yuhan
    Wang, Jian
    Li, Bing
    Zhao, Yuqi
    Zhang, Zekun
    Xiong, Yiming
    Chen, Shiping
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 216
  • [8] Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms
    Tian, Wei
    Zhang, Haitao
    Yang, Neng
    Zhang, Yepeng
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2023, 33 (08) : 1211 - 1238
  • [9] TraceModel: An Automatic Anomaly Detection and Root Cause Localization Framework for Microservice Systems
    Cai, Yang
    Han, Biao
    Su, Jinshu
    Wang, Xiaoyan
    [J]. 2021 17TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING (MSN 2021), 2021, : 512 - 519
  • [10] ModelCoder: A Fault Model based Automatic Root Cause Localization Framework for Microservice Systems
    Cai, Yang
    Han, Biao
    Li, Jie
    Zhao, Na
    Su, Jinshu
    [J]. 2021 IEEE/ACM 29TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2021,