Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引:4
|
作者
Rojas, Elvis [1 ,2 ]
Perez, Diego [1 ]
Calhoun, Jon C. [3 ]
Gomez, Leonardo Bautista [4 ]
Jones, Terry [5 ]
Meneses, Esteban [1 ,6 ]
机构
[1] Costa Rica Inst Technol, Cartago, Costa Rica
[2] Natl Univ Costa Rica, Heredia, Costa Rica
[3] Clemson Univ, Clemson, SC 29631 USA
[4] Barcelona Supercomp Ctr, Barcelona, Spain
[5] Oak Ridge Natl Lab, Oak Ridge, TN USA
[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica
来源
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年
关键词
deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;
D O I
10.1109/Cluster48925.2021.00045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.
引用
收藏
页码:492 / 503
页数:12
相关论文
共 50 条
  • [41] A study on deep learning spatiotemporal models and feature extraction techniques for video understanding
    Suresha, M.
    Kuppa, S.
    Raghukumar, D. S.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2020, 9 (02) : 81 - 101
  • [42] A study on deep learning spatiotemporal models and feature extraction techniques for video understanding
    M. Suresha
    S. Kuppa
    D. S. Raghukumar
    International Journal of Multimedia Information Retrieval, 2020, 9 : 81 - 101
  • [43] Understanding the black-box: towards interpretable and reliable deep learning models
    Qamar, Tehreem
    Bawany, Narmeen Zakaria
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [44] Intelligent Scene Recognition and understanding Basing on Deep Learning Models and Image Databases
    Albalawi, Fawaz
    Alanazi, Yousef
    Alyami, Hamad
    Messoudi, Wassim
    Alhmiedat, Tareq
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (06): : 479 - 484
  • [45] Sensitivity and Specificity Evaluation of Deep Learning Models for Detection of Pneumoperitoneum on Chest Radiographs
    Goyal, Manu
    Austin-Strohbehn, Judith
    Sun, Sean J.
    Rodriguez, Karen
    Sin, Jessica M.
    Cheung, Yvonne Y.
    Hassanpour, Saeed
    ARTIFICIAL INTELLIGENCE IN MEDICINE (AIME 2021), 2021, : 307 - 317
  • [46] A hybrid framework for glaucoma detection through federated machine learning and deep learning models
    Aljohani, Abeer
    Aburasain, Rua Y.
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [47] Understanding Influences of Driving Fatigue on Driver Fingerprinting Identification Through Deep Learning
    Sun, Yifan
    Wu, Chaozhong
    Zhang, Hui
    Ferreira, Sara
    Tavares, Jose Pedro
    Ding, Naikan
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2024, 73 (02) : 1829 - 1844
  • [48] Optimism in the Face of Adversity: Understanding and Improving Deep Learning Through Adversarial Robustness
    Ortiz-Jimenez, Guillermo
    Modas, Apostolos
    Moosavi-Dezfooli, Seyed-Mohsen
    Frossard, Pascal
    PROCEEDINGS OF THE IEEE, 2021, 109 (05) : 635 - 659
  • [49] Enhancing Psychologists' Understanding Through Explainable Deep Learning Framework for ADHD Diagnosis
    Rehman, Abdul
    Lin, Jerry Chun-Wei
    Heldal, Ilona
    EXPERT SYSTEMS, 2025, 42 (02)
  • [50] Toward Grapevine Digital Ampelometry Through Vision Deep Learning Models
    Magalhaes, Sandro Costa
    Castro, Luis
    Rodrigues, Leandro
    Padilha, Tiago Cerveira
    de Carvalho, Frederico
    dos Santos, Filipe Neves
    Pinho, Tatiana
    Moreira, Germano
    Cunha, Jorge
    Cunha, Mario
    Silva, Paulo
    Moreira, Antonio Paulo
    IEEE SENSORS JOURNAL, 2023, 23 (09) : 10132 - 10139