Fault prediction under the microscope: A closer look into HPC systems

被引:0
|
作者
Gainaru, Ana [1 ]
Cappello, Franck [2 ]
Snir, Marc [3 ]
Kramer, William [4 ]
机构
[1] UIUC, Dept Comp Sci, Urbana, IL 61801 USA
[2] INRIA, Paris, France
[3] MCS, ANL, Lemont, IL USA
[4] UIUC, NCSA, Urbana, IL USA
关键词
fault tolerance; large-scale HPC systems; signal analysis; fault detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable prediction system to anticipate failures and their locations. Thus far, research in this field has used ideal predictors that were not implemented in real HPC systems. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA (Event Log Signal Analyzer) toolkit and offer an adaptive and more efficient prediction module. Our goal is to provide models that characterize the normal behavior of a system and the way faults affect it. Being able to detect deviations from normality quickly is the foundation of accurate fault prediction. However, this is challenging because component failure dynamics are heterogeneous in space and time. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems and by investigating the characteristics and bottlenecks of each step of the prediction process. Furthermore, we analyze the prediction's precision and recall impact on current checkpointing strategies and highlight future improvements and directions for research in this field.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] A closer look into slickensides: Deformation on and under fault surfaces
    Ortega-Arroyo, Daniel
    Pec, Matej
    [J]. JOURNAL OF STRUCTURAL GEOLOGY, 2023, 171
  • [2] Textured Breast Implants: A Closer Look at the Surface Debris Under the Microscope
    Webb, Leland H.
    Aime, Victoria L.
    Do, Annie
    Mossman, Kenneth
    Mahabir, Raman C.
    [J]. PLASTIC SURGERY, 2017, 25 (03) : 179 - 183
  • [3] A closer look at discordant placental echogenicity: two cases under the microscope
    Lanna, Mariano M.
    Toto, Valentina
    Faiola, Stefano
    Casati, Daniela
    Bulfamante, Gaetano P.
    Cetin, Irene
    Rustico, Maria Angela
    [J]. CLINICAL CASE REPORTS, 2021, 9 (07):
  • [4] A Closer Look at Fault Tolerance
    Taubenfeld, Gadi
    [J]. THEORY OF COMPUTING SYSTEMS, 2018, 62 (05) : 1085 - 1108
  • [5] A Closer Look at Fault Tolerance
    Gadi Taubenfeld
    [J]. Theory of Computing Systems, 2018, 62 : 1085 - 1108
  • [6] Megabeasts under the microscope: a closer look at Quaternary extinctions in the Asia-Pacific
    Lubeek, Julien K.
    Westaway, Kira E.
    [J]. QUATERNARY INTERNATIONAL, 2020, 568 : 1 - 19
  • [7] Microscope gives closer look with expanded visual range
    不详
    [J]. R&D MAGAZINE, 1996, 38 (13): : 43 - 43
  • [8] Literature under the Microscope: Taking a Closer Look at Ramon y Cajal's Narrative Fiction
    Gomez, Michael A.
    [J]. BULLETIN OF SPANISH STUDIES, 2018, 95 (01) : 55 - 77
  • [9] CLOSER LOOK AT DATA BUS SYSTEMS
    ANDREIEV, N
    [J]. CONTROL ENGINEERING, 1977, 24 (07) : 33 - 36
  • [10] A Closer Look at Who "Chokes Under Pressure"
    Sattizahn, Jason R.
    Moser, Jason S.
    Beilock, Sian L.
    [J]. JOURNAL OF APPLIED RESEARCH IN MEMORY AND COGNITION, 2016, 5 (04) : 470 - 477