Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引：4

作者：

Rojas, Elvis ^{[1
,2
]}

Perez, Diego ^{[1
]}

Calhoun, Jon C. ^{[3
]}

Gomez, Leonardo Bautista ^{[4
]}

Jones, Terry ^{[5
]}

Meneses, Esteban ^{[1
,6
]}

机构：

[1] Costa Rica Inst Technol, Cartago, Costa Rica

[2] Natl Univ Costa Rica, Heredia, Costa Rica

[3] Clemson Univ, Clemson, SC 29631 USA

[4] Barcelona Supercomp Ctr, Barcelona, Spain

[5] Oak Ridge Natl Lab, Oak Ridge, TN USA

[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年

关键词：

deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;

D O I：

10.1109/Cluster48925.2021.00045

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.

引用

页码：492 / 503

页数：12

共 50 条

[41] A study on deep learning spatiotemporal models and feature extraction techniques for video understanding
Suresha, M.
Kuppa, S.
Raghukumar, D. S.
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2020, 9 (02) : 81 - 101
[42] A study on deep learning spatiotemporal models and feature extraction techniques for video understanding
M. Suresha
S. Kuppa
D. S. Raghukumar
International Journal of Multimedia Information Retrieval, 2020, 9 : 81 - 101
[43] Understanding the black-box: towards interpretable and reliable deep learning models
Qamar, Tehreem
Bawany, Narmeen Zakaria
PEERJ COMPUTER SCIENCE, 2023, 9
[44] Intelligent Scene Recognition and understanding Basing on Deep Learning Models and Image Databases
Albalawi, Fawaz
Alanazi, Yousef
Alyami, Hamad
Messoudi, Wassim
Alhmiedat, Tareq
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (06): : 479 - 484
[45] Sensitivity and Specificity Evaluation of Deep Learning Models for Detection of Pneumoperitoneum on Chest Radiographs
Goyal, Manu
Austin-Strohbehn, Judith
Sun, Sean J.
Rodriguez, Karen
Sin, Jessica M.
Cheung, Yvonne Y.
Hassanpour, Saeed
ARTIFICIAL INTELLIGENCE IN MEDICINE (AIME 2021), 2021, : 307 - 317
[46] A hybrid framework for glaucoma detection through federated machine learning and deep learning models
Aljohani, Abeer
Aburasain, Rua Y.
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
[47] Understanding Influences of Driving Fatigue on Driver Fingerprinting Identification Through Deep Learning
Sun, Yifan
Wu, Chaozhong
Zhang, Hui
Ferreira, Sara
Tavares, Jose Pedro
Ding, Naikan
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2024, 73 (02) : 1829 - 1844
[48] Optimism in the Face of Adversity: Understanding and Improving Deep Learning Through Adversarial Robustness
Ortiz-Jimenez, Guillermo
Modas, Apostolos
Moosavi-Dezfooli, Seyed-Mohsen
Frossard, Pascal
PROCEEDINGS OF THE IEEE, 2021, 109 (05) : 635 - 659
[49] Enhancing Psychologists' Understanding Through Explainable Deep Learning Framework for ADHD Diagnosis
Rehman, Abdul
Lin, Jerry Chun-Wei
Heldal, Ilona
EXPERT SYSTEMS, 2025, 42 (02)
[50] Toward Grapevine Digital Ampelometry Through Vision Deep Learning Models
Magalhaes, Sandro Costa
Castro, Luis
Rodrigues, Leandro
Padilha, Tiago Cerveira
de Carvalho, Frederico
dos Santos, Filipe Neves
Pinho, Tatiana
Moreira, Germano
Cunha, Jorge
Cunha, Mario
Silva, Paulo
Moreira, Antonio Paulo
IEEE SENSORS JOURNAL, 2023, 23 (09) : 10132 - 10139

← 1 2 3 4 5 →