Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

被引:4
|
作者
Rojas, Elvis [1 ,2 ]
Perez, Diego [1 ]
Calhoun, Jon C. [3 ]
Gomez, Leonardo Bautista [4 ]
Jones, Terry [5 ]
Meneses, Esteban [1 ,6 ]
机构
[1] Costa Rica Inst Technol, Cartago, Costa Rica
[2] Natl Univ Costa Rica, Heredia, Costa Rica
[3] Clemson Univ, Clemson, SC 29631 USA
[4] Barcelona Supercomp Ctr, Barcelona, Spain
[5] Oak Ridge Natl Lab, Oak Ridge, TN USA
[6] Costa Rica Natl High Technol Ctr, San Jose, Costa Rica
来源
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年
关键词
deep learning; resilience; checkpoint; neural networks; high-performance computing; HDF5; fault injection; FAULT-TOLERANCE; NEURAL-NETWORKS;
D O I
10.1109/Cluster48925.2021.00045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.
引用
收藏
页码:492 / 503
页数:12
相关论文
共 50 条
  • [1] Estimating Soft Processor Soft Error Sensitivity Through Fault Injection
    Harward, Nathan A.
    Gardiner, Michael R.
    Hsiao, Luke W.
    Wirthlin, Michael J.
    2015 IEEE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2015, : 143 - 150
  • [2] Check-QZP: A Lightweight Checkpoint Mechanism for Deep Learning Frameworks
    Lee, Sangheon
    Moon, Gyupin
    Lee, Chanyong
    Kim, Hyunwoo
    An, Donghyeok
    Kang, Donghyun
    APPLIED SCIENCES-BASEL, 2024, 14 (19):
  • [3] Understanding the implementation issues when using deep learning frameworks
    Liu, Chao
    Cai, Runfeng
    Zhou, Yiqun
    Chen, Xin
    Hu, Haibo
    Yan, Meng
    INFORMATION AND SOFTWARE TECHNOLOGY, 2024, 166
  • [4] Understanding Bugs in Multi-Language Deep Learning Frameworks
    Li, Zengyang
    Wang, Sicheng
    Wang, Wenshuo
    Liang, Peng
    Mo, Ran
    Li, Bing
    2023 IEEE/ACM 31ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC, 2023, : 328 - 338
  • [5] A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning
    Rojas, Elvis
    Perez, Diego
    Meneses, Esteban
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 190
  • [6] Understanding the Predication Mechanism of Deep Learning through Error Propagation among Parameters in Strong Lensing Case
    Xilong Fan
    Peizheng Wang
    Jin Li
    Nan Yang
    ResearchinAstronomyandAstrophysics, 2023, 23 (12) : 254 - 264
  • [7] Understanding the Predication Mechanism of Deep Learning through Error Propagation among Parameters in Strong Lensing Case
    Fan, Xilong
    Wang, Peizheng
    Li, Jin
    Yang, Nan
    RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2023, 23 (12)
  • [8] Understanding Deep Learning Decisions in Statistical Downscaling Models
    Bano-Medina, Jorge
    PROCEEDINGS OF 2020 10TH INTERNATIONAL CONFERENCE ON CLIMATE INFORMATICS (CI2020), 2020, : 79 - 85
  • [9] A Hierarchical Assessment Strategy on Soft Error Propagation in Deep Learning Controller
    Liu, Ting
    Fu, Yuzhuo
    Zhang, Yan
    Shi, Bin
    2021 26TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2021, : 878 - 884
  • [10] Understanding architecture age and style through deep learning
    Sun, Maoran
    Zhang, Fan
    Duarte, Fabio
    Ratti, Carlo
    CITIES, 2022, 128