Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引:4
|
作者
Lin, Fred [1 ]
Davoli, Antonio [1 ]
Akbar, Imran [1 ]
Kalmanje, Sukumar [1 ]
Silva, Leandro [1 ]
Stamford, John [1 ]
Golany, Yanai [1 ]
Piazza, Jim [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S50200.2020.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.
引用
收藏
页码:13 / 16
页数:4
相关论文
共 50 条
  • [41] Assessing the Vulnerability of Network Topologies under Large-Scale Regional Failures
    Peng, Wei
    Li, Zimu
    Liu, Yujing
    Su, Jinshu
    JOURNAL OF COMMUNICATIONS AND NETWORKS, 2012, 14 (04) : 451 - 460
  • [42] Healing of large-scale failures in WSN by the effectual placement of relay nodes
    Rajeswari, Gopinathan
    Murugan, Krishnan
    IET COMMUNICATIONS, 2020, 14 (17) : 3030 - 3038
  • [43] A Large-Scale Study of Failures in High-Performance Computing Systems
    Schroeder, Bianca
    Gibson, Garth A.
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2010, 7 (04) : 337 - 350
  • [44] Cloud2HDD: Large-Scale HDD Data Analysis on Cloud for Cloud Datacenters
    Zeydan, Engin
    Arslan, Suayb S.
    2020 23RD CONFERENCE ON INNOVATION IN CLOUDS, INTERNET AND NETWORKS AND WORKSHOPS (ICIN 2020), 2020, : 243 - 249
  • [45] Deep Reinforcement Learning for Network Service Recovery in Large-scale Failures
    Akashi, Kazuaki
    Fukuda, Nobukazu
    Kanai, Shunsuke
    Tayama, Kenichi
    2023 19TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM, 2023,
  • [46] Extreme snowstorms lead to large-scale seabird breeding failures in Antarctica
    Descamps, Sebastien
    Hudson, Stephen
    Sulich, Joanna
    Wakefield, Ewan
    Gremillet, David
    Carravieri, Alice
    Orskaug, Sebastian
    Steen, Harald
    CURRENT BIOLOGY, 2023, 33 (05) : R176 - R177
  • [47] A Model for Space-Correlated Failures in Large-Scale Distributed Systems
    Gallet, Matthieu
    Yigitbasi, Nezih
    Javadi, Bahman
    Kondo, Derrick
    Iosup, Alexandru
    Epema, Dick
    EURO-PAR 2010 PARALLEL PROCESSING, PT I, 2010, 6271 : 88 - +
  • [48] A new method of proactive recovery mechanism for large-scale network failures
    Horie, Takuro
    Hasegawa, Go
    Kamei, Satoshi
    Murata, Masayuki
    2009 INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, 2009, : 951 - +
  • [49] Handling large-scale node failures in mobile sensor/robot networks
    Akkaya, Kemal
    Senturk, Izzet F.
    Vemulapalli, Shanthi
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2013, 36 (01) : 195 - 210
  • [50] Large-scale failures of f-α scaling in natural image spectra
    Langer, MS
    JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 2000, 17 (01) : 28 - 33