Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引:4
|
作者
Rojas, Elvis [1 ,2 ]
Perez, Diego [1 ]
Calhoun, Jon C. [3 ]
Gomez, Leonardo Bautista [4 ]
Jones, Terry [5 ]
Meneses, Esteban [1 ,6 ]
机构
[1] Costa Rica Inst Technol, Cartago, Costa Rica
[2] Natl Univ Costa Rica, Heredia, Costa Rica
[3] Clemson Univ, Clemson, SC 29631 USA
[4] Barcelona Supercomp Ctr, Barcelona, Spain
[5] Oak Ridge Natl Lab, Oak Ridge, TN USA
[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica
来源
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年
关键词
deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;
D O I
10.1109/Cluster48925.2021.00045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.
引用
收藏
页码:492 / 503
页数:12
相关论文
共 50 条
  • [21] EXPLAINING DEEP MODELS THROUGH FORGETTABLE LEARNING DYNAMICS
    Benkert, Ryan
    Aribido, Oluwaseun Joseph
    AlRegib, Ghassan
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 3692 - 3696
  • [22] Understanding simple liquids through statistical and deep learning approaches
    Moradzadeh, A.
    Aluru, N. R.
    JOURNAL OF CHEMICAL PHYSICS, 2021, 154 (20):
  • [23] A SENSITIVITY ANALYSIS OF RIVER ENVIRONMENT FACTORS THROUGH DEEP LEARNING
    Zhang, Shengping
    Qi, Jie
    INTERNATIONAL JOURNAL OF GEOMATE, 2022, 23 (97): : 146 - 153
  • [24] Understanding Corporate Governance Through Learning Models of Managerial Competence
    Hermalin, Benjamin E.
    Weisbach, Michael S.
    ASIA-PACIFIC JOURNAL OF FINANCIAL STUDIES, 2019, 48 (01) : 7 - 29
  • [25] Global system understanding of simulation models through machine learning
    Backes, André
    VDI Berichte, 2022, (2407): : 563 - 570
  • [26] Advancing the understanding of sustainable business models through organizational learning
    Ademi, Bejtush
    Saetre, Alf Steinar
    Klungseth, Nora Johanne
    BUSINESS STRATEGY AND THE ENVIRONMENT, 2024, 33 (06) : 5174 - 5194
  • [27] EXPLAINING AI: UNDERSTANDING DEEP LEARNING MODELS FOR HERITAGE POINT CLOUDS
    Matrone, F.
    Felicetti, A.
    Paolanti, M.
    Pierdicca, R.
    29TH CIPA SYMPOSIUM DOCUMENTING, UNDERSTANDING, PRESERVING CULTURAL HERITAGE. HUMANITIES AND DIGITAL TECHNOLOGIES FOR SHAPING THE FUTURE, VOL. 10-M-1, 2023, : 207 - 214
  • [28] Understanding and Mitigating the Soft Error of Contrastive Language-Image Pre-training Models
    Shi, Yihao
    Wang, Bo
    Luo, Shengbai
    Xue, Qingshan
    Zhang, Xueyi
    Ma, Sheng
    8TH INTERNATIONAL TEST CONFERENCE IN ASIA, ITC-ASIA 2024, 2024,
  • [29] Computational frameworks integrating deep learning and statistical models in mining multimodal omics data
    Lac, Leann
    Leung, Carson K.
    Hu, Pingzhao
    JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 152
  • [30] Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset
    Lin, Zhongyi
    Ota, Jeffrey M.
    Owens, John D.
    Muyan-Ozcelik, Pinar
    2018 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2018, : 1197 - 1203