Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

被引:1
|
作者
Lin, Fred [1 ]
Bolla, Bhargav [1 ]
Pinkham, Eric [1 ]
Kodner, Neil [1 ]
Moore, Daniel [1 ]
Desai, Amol [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S52858.2021.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware failures, and network or power disruptions at the datacenters. In this paper, we present an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations. We observed that 1% of the reboots in our hardware fleet were associated with kernel panics and out-of-memory events, and these reboots exhibit strong locality temporally and across services
引用
收藏
页码:37 / 40
页数:4
相关论文
共 50 条
  • [1] Jarvis: Large-scale Server Monitoring with Adaptive Near-data Processing
    Sandur, Atul
    Park, ChanHo
    Volos, Stavros
    Agha, Gul
    Jeon, Myeongjae
    [J]. 2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 1408 - 1422
  • [2] DEFECT ROOT CAUSE ANALYSIS IN LARGE SCALE MANUFACTURING SYSTEM
    Noursadeghi, Elaheh
    Kamani, Parisa
    Afshar, Ahmad
    [J]. 2011 INTERNATIONAL CONFERENCE ON COMPUTER AND COMPUTATIONAL INTELLIGENCE (ICCCI 2011), 2012, : 179 - 184
  • [3] Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment
    Lin, Fred
    Muzumdar, Keyur
    Laptev, Nikolay Pavlovich
    Curelea, Mihai-Valentin
    Lee, Seunghak
    Sankar, Sriram
    [J]. PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2020, 4 (02)
  • [4] Progress in Root Cause and Fault Propagation Analysis of Large-Scale Industrial Processes
    Yang, Fan
    Xiao, Deyun
    [J]. JOURNAL OF CONTROL SCIENCE AND ENGINEERING, 2012, 2012
  • [5] Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment
    Lin, Fred
    Muzumdar, Keyur
    Pavlovich Laptev, Nikolay
    Curelea, Mihai-Valentin
    Lee, Seunghak
    Sankar, Sriram
    [J]. Performance Evaluation Review, 2020, 48 (01): : 25 - 26
  • [6] ESIR: A Deployment System for Large-scale Server Cluster
    Xue, Zhenghua
    Dong, Xiaoshe
    Li, Junyang
    Tian, Hongbo
    [J]. GCC 2008: SEVENTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2008, : 563 - 569
  • [7] Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks
    Gan, Yu
    Liu, Guiyang
    Zhang, Xin
    Zhou, Qi
    Wu, Jiesheng
    Jiang, Jiangwei
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2023, VOL 4, 2023, : 324 - 337
  • [8] Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
    Gautam Pal
    Xianbin Hong
    Zhuo Wang
    Hongyi Wu
    Gangmin Li
    Katie Atkinson
    [J]. Journal of Big Data, 6
  • [9] Lifelong Machine Learning and root cause analysis for large-scale cancer patient data
    Pal, Gautam
    Hong, Xianbin
    Wang, Zhuo
    Wu, Hongyi
    Li, Gangmin
    Atkinson, Katie
    [J]. JOURNAL OF BIG DATA, 2019, 6 (01)
  • [10] TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
    Ding, Ruomeng
    Zhang, Chaoyun
    Wang, Lu
    Xu, Yong
    Ma, Minghua
    Wu, Xiaomin
    Zhang, Meng
    Chen, Qingjun
    Gao, Xin
    Gao, Xuedong
    Fan, Hao
    Rajmohan, Saravan
    Lin, Qingwei
    Zhang, Dongmei
    [J]. PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 1762 - 1773