Fault prediction under the microscope: A closer look into HPC systems

被引：0

作者：

Gainaru, Ana ^{[1
]}

Cappello, Franck ^{[2
]}

Snir, Marc ^{[3
]}

Kramer, William ^{[4
]}

机构：

[1] UIUC, Dept Comp Sci, Urbana, IL 61801 USA

[2] INRIA, Paris, France

[3] MCS, ANL, Lemont, IL USA

[4] UIUC, NCSA, Urbana, IL USA

来源：

2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2012年

关键词：

fault tolerance; large-scale HPC systems; signal analysis; fault detection;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.

引用

页数：11

共 50 条

[31] Admission systems to dental school in Europe: a closer look at Flanders
Buyse, T.
Lievens, F.
Martens, L.
EUROPEAN JOURNAL OF DENTAL EDUCATION, 2010, 14 (04) : 215 - 220
[32] Online Fault Classification in HPC Systems Through Machine Learning
Netti, Alessio
Kiziltan, Zeynep
Babaoglu, Ozalp
Sirbu, Alina
Bartolini, Andrea
Borghesi, Andrea
EURO-PAR 2019: PARALLEL PROCESSING, 2019, 11725 : 3 - 16
[33] Closer Look at the Uncertainty Estimation in Semantic Segmentation under Distributional Shift
Cygert, Sebastian
Wroblewski, Bartlomiej
Wozniak, Karol
Slowinski, Radoslaw
Czyzewski, Andrzej
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[34] A Closer Look at Access Control in Multi-User Voice Systems
Shafei, Hassan A.
Tan, Chiu C.
IEEE ACCESS, 2024, 12 : 40933 - 40946
[35] A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems
Snyder, Shane
Carns, Philip
Jenkins, Jonathan
Harms, Kevin
Ross, Robert
Mubarak, Misbah
Carothers, Christopher
HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 237 - 248
[36] Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems
Bautista-Gomez, Leonardo
Ropars, Thomas
Maruyama, Naoya
Cappello, Franck
Matsuoka, Satoshi
2012 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2012, : 355 - 363
[37] Is Bigger Better? A Closer Look at Small Health Systems in the United States
Sherry, Tisamarie B.
Damberg, Cheryl L.
DeYoreo, Maria
Bogart, Andy
Agniel, Denis
Ridgely, M. Susan
Escarce, Jose J.
MEDICAL CARE, 2022, 60 (07) : 504 - 511
[38] CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems
Hussain, Zaeem
Cui, Xiaolong
Znati, Taieb
Melhem, Rami
2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 569 - 576
[39] FASE: A framework for scalable performance prediction of HPC systems and applications
Grobelny, Eric
Bueno, David
Troxel, Ian
George, Alan D.
Vetter, Jeffrey S.
SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2007, 83 (10): : 721 - 745
[40] Fault prediction of power electronics modules and systems under complex working conditions
Di, Yuan
Jin, Chao
Bagheri, Behrad
Shi, Zhe
Ardakani, Hossein Davari
Tang, Zhijun
Lee, Jay
COMPUTERS IN INDUSTRY, 2018, 97 : 1 - 9

← 1 2 3 4 5 →