Predicting Remediations for Hardware Failures in Large-Scale Datacenters

被引：4

作者：

Lin, Fred ^{[1
]}

Davoli, Antonio ^{[1
]}

Akbar, Imran ^{[1
]}

Kalmanje, Sukumar ^{[1
]}

Silva, Leandro ^{[1
]}

Stamford, John ^{[1
]}

Golany, Yanai ^{[1
]}

Piazza, Jim ^{[1
]}

Sankar, Sriram ^{[1
]}

机构：

[1] Facebook Inc, Menlo Pk, CA 94025 USA

来源：

2020 50TH ANNUAL IEEE-IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS-SUPPLEMENTAL VOLUME (DSN-S) | 2020年

关键词：

D O I：

10.1109/DSN-S50200.2020.00016

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations. In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.

引用

页码：13 / 16

页数：4

共 50 条

[31] A Wafer-Scale Neuromorphic Hardware System for Large-Scale Neural Modeling
Schemmel, Johannes
Bruederle, Daniel
Gruebl, Andreas
Hock, Matthias
Meier, Karlheinz
Millner, Sebastian
2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, 2010, : 1947 - 1950
[32] Large-Scale Simulations of Plastic Neural Networks on Neuromorphic Hardware
Knight, James C.
Tully, Philip J.
Kaplan, Bernhard A.
Lansner, Anders
Furber, Steve B.
FRONTIERS IN NEUROANATOMY, 2016, 10
[33] SLINGER: large-scale learning for predicting gene expression
Vervier, Kevin
Michaelson, Jacob J.
SCIENTIFIC REPORTS, 2016, 6
[34] SLINGER: large-scale learning for predicting gene expression
Kévin Vervier
Jacob J. Michaelson
Scientific Reports, 6
[35] Tools for Predicting the Reliability of Large-Scale Storage Systems
Hall, Robert J.
ACM TRANSACTIONS ON STORAGE, 2016, 12 (04)
[36] A large-scale study of failures in high-performance computing systems
Schroeder, Bianca
Gibson, Garth A.
DSN 2006 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2006, : 249 - 258
[37] Improving packet delivery performance of BGP during large-scale failures
Sahoo, Amit
Kant, Krishna
Mohapatra, Prasant
GLOBECOM 2007: 2007 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, VOLS 1-11, 2007, : 1850 - +
[38] Detecting and Localizing Large-Scale Router Failures Using Active Probes
Zheng, Qiang
Cao, Guohong
La Porta, Tom
Swami, Ananthram
2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1170 - 1175
[39] Local floods induce large-scale abrupt failures of road networks
Weiping Wang
Saini Yang
H. Eugene Stanley
Jianxi Gao
Nature Communications, 10
[40] Local floods induce large-scale abrupt failures of road networks
Wang, Weiping
Yang, Saini
Stanley, H. Eugene
Gao, Jianxi
NATURE COMMUNICATIONS, 2019, 10 (1)

← 1 2 3 4 5 →