Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引:4
|
作者
Lin, Fred [1 ]
Davoli, Antonio [1 ]
Akbar, Imran [1 ]
Kalmanje, Sukumar [1 ]
Silva, Leandro [1 ]
Stamford, John [1 ]
Golany, Yanai [1 ]
Piazza, Jim [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S50200.2020.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.
引用
收藏
页码:13 / 16
页数:4
相关论文
共 50 条
  • [31] A Wafer-Scale Neuromorphic Hardware System for Large-Scale Neural Modeling
    Schemmel, Johannes
    Bruederle, Daniel
    Gruebl, Andreas
    Hock, Matthias
    Meier, Karlheinz
    Millner, Sebastian
    2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 1947 - 1950
  • [32] Large-Scale Simulations of Plastic Neural Networks on Neuromorphic Hardware
    Knight, James C.
    Tully, Philip J.
    Kaplan, Bernhard A.
    Lansner, Anders
    Furber, Steve B.
    FRONTIERS IN NEUROANATOMY, 2016, 10
  • [33] SLINGER: large-scale learning for predicting gene expression
    Vervier, Kevin
    Michaelson, Jacob J.
    SCIENTIFIC REPORTS, 2016, 6
  • [34] SLINGER: large-scale learning for predicting gene expression
    Kévin Vervier
    Jacob J. Michaelson
    Scientific Reports, 6
  • [35] Tools for Predicting the Reliability of Large-Scale Storage Systems
    Hall, Robert J.
    ACM TRANSACTIONS ON STORAGE, 2016, 12 (04)
  • [36] A large-scale study of failures in high-performance computing systems
    Schroeder, Bianca
    Gibson, Garth A.
    DSN 2006 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2006, : 249 - 258
  • [37] Improving packet delivery performance of BGP during large-scale failures
    Sahoo, Amit
    Kant, Krishna
    Mohapatra, Prasant
    GLOBECOM 2007: 2007 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, VOLS 1-11, 2007, : 1850 - +
  • [38] Detecting and Localizing Large-Scale Router Failures Using Active Probes
    Zheng, Qiang
    Cao, Guohong
    La Porta, Tom
    Swami, Ananthram
    2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1170 - 1175
  • [39] Local floods induce large-scale abrupt failures of road networks
    Weiping Wang
    Saini Yang
    H. Eugene Stanley
    Jianxi Gao
    Nature Communications, 10
  • [40] Local floods induce large-scale abrupt failures of road networks
    Wang, Weiping
    Yang, Saini
    Stanley, H. Eugene
    Gao, Jianxi
    NATURE COMMUNICATIONS, 2019, 10 (1)