Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引：4

作者：

Rojas, Elvis ^{[1
,2
]}

Perez, Diego ^{[1
]}

Calhoun, Jon C. ^{[3
]}

Gomez, Leonardo Bautista ^{[4
]}

Jones, Terry ^{[5
]}

Meneses, Esteban ^{[1
,6
]}

机构：

[1] Costa Rica Inst Technol, Cartago, Costa Rica

[2] Natl Univ Costa Rica, Heredia, Costa Rica

[3] Clemson Univ, Clemson, SC 29631 USA

[4] Barcelona Supercomp Ctr, Barcelona, Spain

[5] Oak Ridge Natl Lab, Oak Ridge, TN USA

[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年

关键词：

deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;

D O I：

10.1109/Cluster48925.2021.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.

引用

页码：492 / 503

页数：12

共 50 条

[31] Correction to: Deep learning frameworks to learn prediction and simulation focused control system models
Turcan Tuna
Aykut Beke
Tufan Kumbasar
Applied Intelligence, 2022, 52 : 680 - 680
[32] Looking Through the Deep Glasses: How Large Language Models Enhance Explainability of Deep Learning Models
Spitzer, Philipp
Celis, Sebastian
Martin, Dominik
Kuehl, Niklas
Satzger, Gerhard
PROCEEDINGS OF THE 2024 CONFERENCE ON MENSCH UND COMPUTER, MUC 2024, 2024, : 566 - 570
[33] Genome analysis through image processing with deep learning models
Zhang, Yao-zhong
Imoto, Seiya
JOURNAL OF HUMAN GENETICS, 2024, 69 (10) : 519 - 525
[34] Analyzing Neuroimaging Data Through Recurrent Deep Learning Models
Thomas, Armin W.
Heekeren, Hauke R.
Mueller, Klaus-Robert
Samek, Wojciech
FRONTIERS IN NEUROSCIENCE, 2019, 13
[35] Improving quantitative flowering models through a better understanding of the phases of photoperiod sensitivity
Adams, SR
Pearson, S
Hadley, P
JOURNAL OF EXPERIMENTAL BOTANY, 2001, 52 (357) : 655 - 662
[36] Understanding landscapes through knowledge management frameworks, spatial models, decision support tools and visualisation
Department of Primary Industries, Parkville Centre, VIC, Australia
不详
VIC, Australia
不详
不详
VIC, Australia
不详
不详
VIC, Australia
Lect. Notes Geoinformation Cartogr., 2008, 9783540691679 (3-16): : 3 - 16
[37] Feature Selection Methods for Deep Learning Models of Soft Sensors in Oil Refining
I. S. Lazukhin
M. I. Petrovskiy
I. V. Mashechkin
Moscow University Physics Bulletin, 2024, 79 (Suppl 2) : S872 - S889
[38] LLC Block Reuse Predictor Design using Deep Learning to Mitigate Soft Error in Multicore
Choudhury, Avishek
Mondal, Brototi
Paul, Kolin
Sikdar, Biplab K.
PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, VLSID 2024 AND 23RD INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, ES 2024, 2024, : 690 - 695
[39] Understanding Naturalistic Facial Expressions with Deep Learning and Multimodal Large Language Models
Bian, Yifan
Kuester, Dennis
Liu, Hui
Krumhuber, Eva G.
SENSORS, 2024, 24 (01)
[40] Understanding Privacy Risks in Typical Deep Learning Models for Medical Image Analysis
Subbanna, Nagesh
Tuladhar, Anup
Wilms, Matthias
Forkert, Nils D.
MEDICAL IMAGING 2021: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2021, 11601

← 1 2 3 4 5 →