Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引：4

作者：

Rojas, Elvis ^{[1
,2
]}

Perez, Diego ^{[1
]}

Calhoun, Jon C. ^{[3
]}

Gomez, Leonardo Bautista ^{[4
]}

Jones, Terry ^{[5
]}

Meneses, Esteban ^{[1
,6
]}

机构：

[1] Costa Rica Inst Technol, Cartago, Costa Rica

[2] Natl Univ Costa Rica, Heredia, Costa Rica

[3] Clemson Univ, Clemson, SC 29631 USA

[4] Barcelona Supercomp Ctr, Barcelona, Spain

[5] Oak Ridge Natl Lab, Oak Ridge, TN USA

[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年

关键词：

deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;

D O I：

10.1109/Cluster48925.2021.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.

引用

页码：492 / 503

页数：12

共 50 条

[21] EXPLAINING DEEP MODELS THROUGH FORGETTABLE LEARNING DYNAMICS
Benkert, Ryan
Aribido, Oluwaseun Joseph
AlRegib, Ghassan
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 3692 - 3696
[22] Understanding simple liquids through statistical and deep learning approaches
Moradzadeh, A.
Aluru, N. R.
JOURNAL OF CHEMICAL PHYSICS, 2021, 154 (20):
[23] A SENSITIVITY ANALYSIS OF RIVER ENVIRONMENT FACTORS THROUGH DEEP LEARNING
Zhang, Shengping
Qi, Jie
INTERNATIONAL JOURNAL OF GEOMATE, 2022, 23 (97): : 146 - 153
[24] Understanding Corporate Governance Through Learning Models of Managerial Competence
Hermalin, Benjamin E.
Weisbach, Michael S.
ASIA-PACIFIC JOURNAL OF FINANCIAL STUDIES, 2019, 48 (01) : 7 - 29
[25] Global system understanding of simulation models through machine learning
Backes, André
VDI Berichte, 2022, (2407): : 563 - 570
[26] Advancing the understanding of sustainable business models through organizational learning
Ademi, Bejtush
Saetre, Alf Steinar
Klungseth, Nora Johanne
BUSINESS STRATEGY AND THE ENVIRONMENT, 2024, 33 (06) : 5174 - 5194
[27] EXPLAINING AI: UNDERSTANDING DEEP LEARNING MODELS FOR HERITAGE POINT CLOUDS
Matrone, F.
Felicetti, A.
Paolanti, M.
Pierdicca, R.
29TH CIPA SYMPOSIUM DOCUMENTING, UNDERSTANDING, PRESERVING CULTURAL HERITAGE. HUMANITIES AND DIGITAL TECHNOLOGIES FOR SHAPING THE FUTURE, VOL. 10-M-1, 2023, : 207 - 214
[28] Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models
Shi, Yihao
Wang, Bo
Luo, Shengbai
Xue, Qingshan
Zhang, Xueyi
Ma, Sheng
8TH INTERNATIONAL TEST CONFERENCE IN ASIA, ITC-ASIA 2024, 2024,
[29] Computational frameworks integrating deep learning and statistical models in mining multimodal omics data
Lac, Leann
Leung, Carson K.
Hu, Pingzhao
JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 152
[30] Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset
Lin, Zhongyi
Ota, Jeffrey M.
Owens, John D.
Muyan-Ozcelik, Pinar
2018 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2018, : 1197 - 1203

← 1 2 3 4 5 →