Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引:4
|
作者
Rojas, Elvis [1 ,2 ]
Perez, Diego [1 ]
Calhoun, Jon C. [3 ]
Gomez, Leonardo Bautista [4 ]
Jones, Terry [5 ]
Meneses, Esteban [1 ,6 ]
机构
[1] Costa Rica Inst Technol, Cartago, Costa Rica
[2] Natl Univ Costa Rica, Heredia, Costa Rica
[3] Clemson Univ, Clemson, SC 29631 USA
[4] Barcelona Supercomp Ctr, Barcelona, Spain
[5] Oak Ridge Natl Lab, Oak Ridge, TN USA
[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica
来源
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年
关键词
deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;
D O I
10.1109/Cluster48925.2021.00045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.
引用
收藏
页码:492 / 503
页数:12
相关论文
共 50 条
  • [31] Correction to: Deep learning frameworks to learn prediction and simulation focused control system models
    Turcan Tuna
    Aykut Beke
    Tufan Kumbasar
    Applied Intelligence, 2022, 52 : 680 - 680
  • [32] Looking Through the Deep Glasses: How Large Language Models Enhance Explainability of Deep Learning Models
    Spitzer, Philipp
    Celis, Sebastian
    Martin, Dominik
    Kuehl, Niklas
    Satzger, Gerhard
    PROCEEDINGS OF THE 2024 CONFERENCE ON MENSCH UND COMPUTER, MUC 2024, 2024, : 566 - 570
  • [33] Genome analysis through image processing with deep learning models
    Zhang, Yao-zhong
    Imoto, Seiya
    JOURNAL OF HUMAN GENETICS, 2024, 69 (10) : 519 - 525
  • [34] Analyzing Neuroimaging Data Through Recurrent Deep Learning Models
    Thomas, Armin W.
    Heekeren, Hauke R.
    Mueller, Klaus-Robert
    Samek, Wojciech
    FRONTIERS IN NEUROSCIENCE, 2019, 13
  • [35] Improving quantitative flowering models through a better understanding of the phases of photoperiod sensitivity
    Adams, SR
    Pearson, S
    Hadley, P
    JOURNAL OF EXPERIMENTAL BOTANY, 2001, 52 (357) : 655 - 662
  • [36] Understanding landscapes through knowledge management frameworks, spatial models, decision support tools and visualisation
    Department of Primary Industries, Parkville Centre, VIC, Australia
    不详
    VIC, Australia
    不详
    不详
    VIC, Australia
    不详
    不详
    VIC, Australia
    Lect. Notes Geoinformation Cartogr., 2008, 9783540691679 (3-16): : 3 - 16
  • [37] Feature Selection Methods for Deep Learning Models of Soft Sensors in Oil Refining
    I. S. Lazukhin
    M. I. Petrovskiy
    I. V. Mashechkin
    Moscow University Physics Bulletin, 2024, 79 (Suppl 2) : S872 - S889
  • [38] LLC Block Reuse Predictor Design using Deep Learning to Mitigate Soft Error in Multicore
    Choudhury, Avishek
    Mondal, Brototi
    Paul, Kolin
    Sikdar, Biplab K.
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, VLSID 2024 AND 23RD INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, ES 2024, 2024, : 690 - 695
  • [39] Understanding Naturalistic Facial Expressions with Deep Learning and Multimodal Large Language Models
    Bian, Yifan
    Kuester, Dennis
    Liu, Hui
    Krumhuber, Eva G.
    SENSORS, 2024, 24 (01)
  • [40] Understanding Privacy Risks in Typical Deep Learning Models for Medical Image Analysis
    Subbanna, Nagesh
    Tuladhar, Anup
    Wilms, Matthias
    Forkert, Nils D.
    MEDICAL IMAGING 2021: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2021, 11601