Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引:4
|
作者
Lin, Fred [1 ]
Davoli, Antonio [1 ]
Akbar, Imran [1 ]
Kalmanje, Sukumar [1 ]
Silva, Leandro [1 ]
Stamford, John [1 ]
Golany, Yanai [1 ]
Piazza, Jim [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S50200.2020.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.
引用
收藏
页码:13 / 16
页数:4
相关论文
共 50 条
  • [21] Adaptive Algorithms for Diagnosing Large-Scale Failures in Computer Networks
    Tati, Srikar
    Ko, Bong Jun
    Cao, Guohong
    Swami, Ananthram
    La Porta, Thomas
    2012 42ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2012,
  • [22] Analyzing syslog data for diagnosing large-scale network failures
    Kimura, Tatsuaki
    Journal of the Institute of Electronics, Information and Communication Engineers, 2015, 98 (09): : 823 - 828
  • [23] Single and double asperity failures in a large-scale biaxial experiment
    Yoshida, S
    Kato, A
    GEOPHYSICAL RESEARCH LETTERS, 2001, 28 (03) : 451 - 454
  • [24] Impact of Large-Scale Correlated Failures on Multilevel Virtualized Networks
    Medina, Max G.
    Alenazi, Mohammed J. F.
    Cetinkaya, Egemen K.
    2020 IEEE 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE SWITCHING AND ROUTING (IEEE HPSR), 2020,
  • [25] Adaptive Algorithms for Diagnosing Large-Scale Failures in Computer Networks
    Tati, Srikar
    Ko, Bong Jun
    Cao, Guohong
    Swami, Ananthram
    La Porta, Thomas F.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (03) : 646 - 656
  • [26] Characterization of BGP Recovery Time under Large-Scale Failures
    Sahoo, Amit
    Kant, Krishna
    Mohapatra, Prasant
    2006 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, VOLS 1-12, 2006, : 949 - 954
  • [27] Modeling and Analysis of Cascading Failures in Large-Scale Power Grids
    Liu, Yijing
    Zhang, Anna
    Dehghanian, Pooria
    Jung, Jung Kyo
    Habiba, Ummay
    Overbye, Thomas J.
    2022 IEEE KANSAS POWER AND ENERGY CONFERENCE (KPEC 2022), 2022,
  • [28] On the devolution of large-scale sensor networks in the presence of random failures
    Xing, Fei
    Wang, Wenye
    2008 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, PROCEEDINGS, VOLS 1-13, 2008, : 2304 - 2308
  • [29] Benefits of Programmable Topological Routing Policies in RINA-enabled Large-scale Datacenters
    Leon, Sergio
    Perello, Jordi
    Careglio, Davide
    Grasa, Eduard
    Lopez, Diego R.
    Aranda, Pedro A.
    2016 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2016,
  • [30] A Hardware/Application Overlay Model for Large-Scale Neuromorphic Simulation
    Rast, Alexander
    Shahsavari, Mahyar
    Bragg, Graeme M.
    Vousden, Mark L.
    Thomas, David
    Brown, Andrew
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,