Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

被引:1
|
作者
Lin, Fred [1 ]
Bolla, Bhargav [1 ]
Pinkham, Eric [1 ]
Kodner, Neil [1 ]
Moore, Daniel [1 ]
Desai, Amol [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S52858.2021.00027
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware failures, and network or power disruptions at the datacenters. In this paper, we present an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations. We observed that 1% of the reboots in our hardware fleet were associated with kernel panics and out-of-memory events, and these reboots exhibit strong locality temporally and across services
引用
收藏
页码:37 / 40
页数:4
相关论文
共 50 条
  • [41] AOCMS: An adaptive and scalable monitoring system for large-scale clusters
    Xue, Zhenghua
    Dong, Xiaoshe
    Wu, Weiguo
    [J]. APSCC: 2006 IEEE ASIA-PACIFIC CONFERENCE ON SERVICES COMPUTING, PROCEEDINGS, 2006, : 466 - +
  • [42] A cellphone based system for large-scale monitoring of black carbon
    Ramanathan, N.
    Lukac, M.
    Ahmed, T.
    Kar, A.
    Praveen, P. S.
    Honles, T.
    Leong, I.
    Rehman, I. H.
    Schauer, J. J.
    Ramanathan, V.
    [J]. ATMOSPHERIC ENVIRONMENT, 2011, 45 (26) : 4481 - 4487
  • [43] A Large-Scale Customer-Accessible Energy Monitoring System
    Rodrigues, Rafael Nilson
    Zatta, Juliano Kasmirski
    de Souza, Jonas Vieira
    Espindola, Anna Luiza
    de Carvalho, Eduardo Galera
    [J]. 2016 ANNUAL IEEE SYSTEMS CONFERENCE (SYSCON), 2016, : 541 - 546
  • [44] Technical considerations for large-scale parallel reaction monitoring analysis
    Gallien, Sebastien
    Bourmaud, Adele
    Kim, Sang Yoon
    Domon, Bruno
    [J]. JOURNAL OF PROTEOMICS, 2014, 100 : 147 - 159
  • [45] Automation of Large-scale Computer Cluster Monitoring Information Analysis
    Magradze, Erekle
    Nadal, Jordi
    Quadt, Arnulf
    Kawamura, Gen
    Musheghyan, Haykuhi
    [J]. 21ST INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2015), PARTS 1-9, 2015, 664
  • [46] Dynamic Large-Scale Server Scheduling for IVF Queuing Network in Cloud Healthcare System
    Li, Yafei
    Wang, Hongfeng
    Li, Li
    Fu, Yaping
    [J]. COMPLEXITY, 2021, 2021
  • [47] Fault Root Cause Tracing Method of Large-Scale Complicated Equipment Based on Fault Graph
    Huang, Xinlin
    Gao, Jianmin
    Gao, Zhiyong
    [J]. 2011 INTERNATIONAL CONFERENCE ON QUALITY, RELIABILITY, RISK, MAINTENANCE, AND SAFETY ENGINEERING (ICQR2MSE), 2011, : 237 - 241
  • [48] LARGE-SCALE SYSTEM PERSPECTIVES ON ECOLOGICAL MODELING AND ANALYSIS
    HIRATA, H
    ULANOWICZ, RE
    [J]. ECOLOGICAL MODELLING, 1986, 31 (1-4) : 79 - 104
  • [49] Stoichiometric foundation of large-scale biochemical system analysis
    Beard, DA
    Qian, H
    Bassingthwaighte, JB
    [J]. MODELLING IN MOLECULAR BIOLOGY, 2004, : 1 - 19
  • [50] Stability analysis of fuzzy large-scale dynamic system
    Wang, Cheng
    Rao, Congjun
    Liu, Huanbin
    [J]. PROCEEDING OF THE SEVENTH INTERNATIONAL CONFERENCE ON INFORMATION AND MANAGEMENT SCIENCES, 2008, 7 : 377 - 379