Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引：4

作者：

Rojas, Elvis ^{[1
,2
]}

Perez, Diego ^{[1
]}

Calhoun, Jon C. ^{[3
]}

Gomez, Leonardo Bautista ^{[4
]}

Jones, Terry ^{[5
]}

Meneses, Esteban ^{[1
,6
]}

机构：

[1] Costa Rica Inst Technol, Cartago, Costa Rica

[2] Natl Univ Costa Rica, Heredia, Costa Rica

[3] Clemson Univ, Clemson, SC 29631 USA

[4] Barcelona Supercomp Ctr, Barcelona, Spain

[5] Oak Ridge Natl Lab, Oak Ridge, TN USA

[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年

关键词：

deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;

D O I：

10.1109/Cluster48925.2021.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.

引用

页码：492 / 503

页数：12

共 50 条

[1] Estimating Soft Processor Soft Error Sensitivity Through Fault Injection
Harward, Nathan A.
Gardiner, Michael R.
Hsiao, Luke W.
Wirthlin, Michael J.
2015 IEEE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2015, : 143 - 150
[2] Check-QZP: A Lightweight Checkpoint Mechanism for Deep Learning Frameworks
Lee, Sangheon
Moon, Gyupin
Lee, Chanyong
Kim, Hyunwoo
An, Donghyeok
Kang, Donghyun
APPLIED SCIENCES-BASEL, 2024, 14 (19):
[3] Understanding the implementation issues when using deep learning frameworks
Liu, Chao
Cai, Runfeng
Zhou, Yiqun
Chen, Xin
Hu, Haibo
Yan, Meng
INFORMATION AND SOFTWARE TECHNOLOGY, 2024, 166
[4] Understanding Bugs in Multi-Language Deep Learning Frameworks
Li, Zengyang
Wang, Sicheng
Wang, Wenshuo
Liang, Peng
Mo, Ran
Li, Bing
2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 328 - 338
[5] A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning
Rojas, Elvis
Perez, Diego
Meneses, Esteban
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 190
[6] Understanding the Predication Mechanism of Deep Learning through Error Propagation among Parameters in Strong Lensing Case
Xilong Fan
Peizheng Wang
Jin Li
Nan Yang
ResearchinAstronomyandAstrophysics, 2023, 23 (12) : 254 - 264
[7] Understanding the Predication Mechanism of Deep Learning through Error Propagation among Parameters in Strong Lensing Case
Fan, Xilong
Wang, Peizheng
Li, Jin
Yang, Nan
RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2023, 23 (12)
[8] Understanding Deep Learning Decisions in Statistical Downscaling Models
Bano-Medina, Jorge
PROCEEDINGS OF 2020 10TH INTERNATIONAL CONFERENCE ON CLIMATE INFORMATICS (CI2020), 2020, : 79 - 85
[9] A Hierarchical Assessment Strategy on Soft Error Propagation in Deep Learning Controller
Liu, Ting
Fu, Yuzhuo
Zhang, Yan
Shi, Bin
2021 26TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2021, : 878 - 884
[10] Understanding architecture age and style through deep learning
Sun, Maoran
Zhang, Fan
Duarte, Fabio
Ratti, Carlo
CITIES, 2022, 128

← 1 2 3 4 5 →