Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

被引:0
|
作者
Gan, Yu [1 ]
Liu, Guiyang [1 ]
Zhang, Xin [1 ]
Zhou, Qi [1 ]
Wu, Jiesheng [1 ]
Jiang, Jiangwei [1 ]
机构
[1] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
关键词
DISTANCE;
D O I
10.1145/3623278.3624758
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs. To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using unsupervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.
引用
收藏
页码:324 / 337
页数:14
相关论文
共 50 条
  • [1] A Survey of Large-Scale Graph Neural Networks
    Xiao, Guo-Qing
    Li, Xue-Qi
    Chen, Yue-Dan
    Tang, Zhuo
    Jiang, Wen-Jun
    Li, Ken-Li
    [J]. Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (01): : 148 - 171
  • [2] Blocking-based Neighbor Sampling for Large-scale Graph Neural Networks
    Yao, Kai-Lang
    Li, Wu-Jun
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 3307 - 3313
  • [3] Fault Root Cause Tracing Method of Large-Scale Complicated Equipment Based on Fault Graph
    Huang, Xinlin
    Gao, Jianmin
    Gao, Zhiyong
    [J]. 2011 INTERNATIONAL CONFERENCE ON QUALITY, RELIABILITY, RISK, MAINTENANCE, AND SAFETY ENGINEERING (ICQR2MSE), 2011, : 237 - 241
  • [4] Large-Scale Graph Neural Networks: The Past and New Frontiers
    Xue, Rui
    Han, Haoyu
    Zhao, Tong
    Shah, Neil
    Tang, Jiliang
    Liu, Xiaorui
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5835 - 5836
  • [5] Titan: Security Analysis of Large-Scale Hardware Obfuscation Using Graph Neural Networks
    Mankali, Likhitha
    Alrahis, Lilas
    Patnaik, Satwik
    Knechtel, Johann
    Sinanoglu, Ozgur
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2023, 18 : 304 - 318
  • [6] Large-Scale 802.11 Wireless Networks Data Analysis Based on Graph Clustering
    Capdehourat, German
    Bermolen, Paola
    Fiori, Marcelo
    Frevenza, Nicolas
    Larroca, Federico
    Morales, Gaston
    Rattaro, Claudina
    Zunino, Gianina
    [J]. WIRELESS PERSONAL COMMUNICATIONS, 2021, 120 (02) : 1791 - 1819
  • [7] Large-Scale 802.11 Wireless Networks Data Analysis Based on Graph Clustering
    Germán Capdehourat
    Paola Bermolen
    Marcelo Fiori
    Nicolás Frevenza
    Federico Larroca
    Gastón Morales
    Claudina Rattaro
    Gianina Zunino
    [J]. Wireless Personal Communications, 2021, 120 : 1791 - 1819
  • [8] Graph Neural Networks for Friend Ranking in Large-scale Social Platforms
    Sankar, Aravind
    Liu, Yozen
    Yu, Jun
    Shah, Neil
    [J]. PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2021 (WWW 2021), 2021, : 2535 - 2546
  • [9] Heterogeneous Graph Neural Networks for Large-Scale Bid Keyword Matching
    Liu, Zongtao
    Ma, Bin
    Liu, Quan
    Xu, Jian
    Zheng, Bo
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3976 - 3985
  • [10] Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System
    Lin, Fred
    Bolla, Bhargav
    Pinkham, Eric
    Kodner, Neil
    Moore, Daniel
    Desai, Amol
    Sankar, Sriram
    [J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS - SUPPLEMENTAL VOL (DSN 2021), 2021, : 37 - 40