Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引:4
|
作者
Lin, Fred [1 ]
Davoli, Antonio [1 ]
Akbar, Imran [1 ]
Kalmanje, Sukumar [1 ]
Silva, Leandro [1 ]
Stamford, John [1 ]
Golany, Yanai [1 ]
Piazza, Jim [1 ]
Sankar, Sriram [1 ]
机构
[1] Facebook Inc, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/DSN-S50200.2020.00016
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.
引用
收藏
页码:13 / 16
页数:4
相关论文
共 50 条
  • [1] Soft Failures in Large Datacenters
    Sankar, Sriram
    Gurumurthi, Sudhanva
    IEEE COMPUTER ARCHITECTURE LETTERS, 2014, 13 (02) : 105 - 108
  • [2] A methodology for large-scale hardware verification
    Aagaard, MD
    Jones, RB
    Melham, TF
    O'Leary, JW
    Seger, CJH
    FORMAL METHODS IN COMPUTER-AIDED DESIGN, PROCEEDINGS, 2000, 1954 : 263 - 282
  • [3] Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters
    Sun, Yongqian
    Cheng, Daguo
    Yang, Tiankai
    Ji, Yuhe
    Zhang, Shenglin
    Zhu, Man
    Xiong, Xiao
    Fan, Qiliang
    Liang, Minghan
    Pei, Dan
    Ma, Tianchi
    Chen, Yu
    IEEE TRANSACTIONS ON COMPUTERS, 2023, 72 (10) : 2858 - 2871
  • [4] Understanding the Context of Large-Scale IT Project Failures
    Rich, Eliot
    Nelson, Mark R.
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2012, 5 (02) : 1 - 24
  • [5] A Large-Scale Study of Failures on Petascale Supercomputers
    Rui-Tao Liu
    Zuo-Ning Chen
    Journal of Computer Science and Technology, 2018, 33 : 24 - 41
  • [6] A Large-Scale Study of Failures on Petascale Supercomputers
    Liu, Rui-Tao
    Chen, Zuo-Ning
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (01) : 24 - 41
  • [7] Improving BGP convergence delay for large-scale failures
    Sahoo, Amit
    Kant, Krishna
    Mohapatra, Prasant
    DSN 2006 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2006, : 323 - 332
  • [8] Luerformance implications of failures in large-scale cluster scheduling
    Zhang, YY
    Squillante, MS
    Sivasubramaniam, A
    Sahoo, RK
    JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2005, 3277 : 233 - 252
  • [9] Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters
    Mohapatra, Rohan
    Coursey, Austin
    Sengupta, Saptarshi
    2023 IEEE INTERNATIONAL CONFERENCE ON SMART COMPUTING, SMARTCOMP, 2023, : 261 - 266
  • [10] History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters
    Zhan, Yunqi
    Prekas, George
    Fumarola, Giovanni Matteo
    Fontoura, Marcus
    Goiri, Inigo
    Bianchini, Ricardo
    PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2016, : 755 - 770