Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引：4

作者：

Lin, Fred ^{[1
]}

Davoli, Antonio ^{[1
]}

Akbar, Imran ^{[1
]}

Kalmanje, Sukumar ^{[1
]}

Silva, Leandro ^{[1
]}

Stamford, John ^{[1
]}

Golany, Yanai ^{[1
]}

Piazza, Jim ^{[1
]}

Sankar, Sriram ^{[1
]}

机构：

[1] Facebook Inc, Menlo Pk, CA 94025 USA

来源：

2020 50TH ANNUAL IEEE-IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS-SUPPLEMENTAL VOLUME (DSN-S) | 2020年

关键词：

D O I：

10.1109/DSN-S50200.2020.00016

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.

引用

页码：13 / 16

页数：4

共 50 条

[1] Soft Failures in Large Datacenters
Sankar, Sriram
Gurumurthi, Sudhanva
IEEE COMPUTER ARCHITECTURE LETTERS, 2014, 13 (02) : 105 - 108
[2] A methodology for large-scale hardware verification
Aagaard, MD
Jones, RB
Melham, TF
O'Leary, JW
Seger, CJH
FORMAL METHODS IN COMPUTER-AIDED DESIGN, PROCEEDINGS, 2000, 1954 : 263 - 282
[3] Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters
Sun, Yongqian
Cheng, Daguo
Yang, Tiankai
Ji, Yuhe
Zhang, Shenglin
Zhu, Man
Xiong, Xiao
Fan, Qiliang
Liang, Minghan
Pei, Dan
Ma, Tianchi
Chen, Yu
IEEE TRANSACTIONS ON COMPUTERS, 2023, 72 (10) : 2858 - 2871
[4] Understanding the Context of Large-Scale IT Project Failures
Rich, Eliot
Nelson, Mark R.
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2012, 5 (02) : 1 - 24
[5] A Large-Scale Study of Failures on Petascale Supercomputers
Rui-Tao Liu
Zuo-Ning Chen
Journal of Computer Science and Technology, 2018, 33 : 24 - 41
[6] A Large-Scale Study of Failures on Petascale Supercomputers
Liu, Rui-Tao
Chen, Zuo-Ning
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (01) : 24 - 41
[7] Improving BGP convergence delay for large-scale failures
Sahoo, Amit
Kant, Krishna
Mohapatra, Prasant
DSN 2006 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2006, : 323 - 332
[8] Luerformance implications of failures in large-scale cluster scheduling
Zhang, YY
Squillante, MS
Sivasubramaniam, A
Sahoo, RK
JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, 2005, 3277 : 233 - 252
[9] Large-scale End-of-Life Prediction of Hard Disks in Distributed Datacenters
Mohapatra, Rohan
Coursey, Austin
Sengupta, Saptarshi
2023 IEEE INTERNATIONAL CONFERENCE ON SMART COMPUTING, SMARTCOMP, 2023, : 261 - 266
[10] History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters
Zhan, Yunqi
Prekas, George
Fumarola, Giovanni Matteo
Fontoura, Marcus
Goiri, Inigo
Bianchini, Ricardo
PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, 2016, : 755 - 770

← 1 2 3 4 5 →