Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

被引:1
|
作者
Lin, Fred [1 ]
Bolla, Bhargav [1 ]
Pinkham, Eric [1 ]
Kodner, Neil [1 ]
Moore, Daniel [1 ]
Desai, Amol [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S52858.2021.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware failures, and network or power disruptions at the datacenters. In this paper, we present an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations. We observed that 1% of the reboots in our hardware fleet were associated with kernel panics and out-of-memory events, and these reboots exhibit strong locality temporally and across services
引用
收藏
页码:37 / 40
页数:4
相关论文
共 50 条
  • [31] MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems
    Liu, Dewei
    He, Chuan
    Peng, Xin
    Lin, Fan
    Zhang, Chenxi
    Gong, Shengfang
    Li, Ziang
    Ou, Jiayu
    Wu, Zheshun
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2021), 2021, : 338 - 347
  • [32] A Large-scale System for Real-time Glucose Monitoring
    Vu, Long
    Pavuluri, Venkata N.
    Chang, Yuan-chi
    Turaga, Deepak S.
    Zhong, Alex
    Agrawal, Pratik
    Singh, Amit
    Jiang, Boyi
    Chirutha, Krishna
    [J]. 2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOPS (DSN-W), 2018, : 34 - 37
  • [33] On-line monitoring system for large-scale launching equipments
    Cai, W
    Huang, XX
    Zhang, ZL
    [J]. ICEMI 2005: Conference Proceedings of the Seventh International Conference on Electronic Measurement & Instruments, Vol 2, 2005, : 89 - 92
  • [34] DESIGN OF LARGE-SCALE POWER BATTERY SAFETY MONITORING SYSTEM
    Yan, Zixiang
    Yu, Hanbo
    Xiao, Qiang
    Lin, Huipin
    Gao, Mingyu
    [J]. CONFERENCE OF SCIENCE & TECHNOLOGY FOR INTEGRATED CIRCUITS, 2024 CSTIC, 2024,
  • [35] Applications of Integrated Monitoring System for Large-Scale Civil Infrastructures
    Li, Hong-Nan
    Yi, Ting-Hua
    [J]. INTERNATIONAL JOURNAL OF STRUCTURAL STABILITY AND DYNAMICS, 2016, 16 (04)
  • [36] Large-scale machinery monitoring system based on the visual reality
    Zhang, Yusi
    Ruan, Jun
    [J]. PROCEEDINGS OF 2018 IEEE 3RD ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC 2018), 2018, : 863 - 867
  • [37] Integrated traffic flow monitoring system in a large-scale tunnel
    Koga, K
    Inobe, T
    Namai, T
    Kaneko, Y
    [J]. IEEE CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, 1997, : 165 - 170
  • [38] Design and realization of Monitoring and management system of large-scale watercraft
    Zhao Xiaonan
    Gao Weidong
    Xu Jiren
    Gao Huaihui
    Wang Keren
    [J]. 2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL II, 2010, : 317 - 320
  • [39] On-line monitoring system for large-scale launching equipments
    Cai, W
    Huang, XX
    Zhang, ZL
    [J]. ICEMI 2005: Conference Proceedings of the Seventh International Conference on Electronic Measurement & Instruments, Vol 8, 2005, : 659 - 662
  • [40] Design of Embedded Monitoring System for Large-Scale Grain Granary
    Zhang, Xiaodong
    Zhang, Jie
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 1, 2018, : 145 - 148