PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

被引:0
|
作者
Xin-Hai Xu
Xue-Jun Yang
Jing-Ling Xue
Yu-Fei Lin
Yi-Song Lin
机构
[1] National University of Defense Technology,National Laboratory for Parallel and Distributed Processing, School of Computer
[2] School of Computer Science and Engineering University of New South Wales,Programming Languages and Compilers Group
关键词
GPGPU; partial recomputing; fault tolerance; CUDA; checkpointing;
D O I
暂无
中图分类号
学科分类号
摘要
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
引用
收藏
页码:240 / 255
页数:15
相关论文
共 50 条
  • [1] PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
    Xu, Xin-Hai
    Yang, Xue-Jun
    Xue, Jing-Ling
    Lin, Yu-Fei
    Lin, Yi-Song
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2012, 27 (02) : 240 - 255
  • [2] PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
    徐新海
    杨学军
    薛京灵
    林宇斐
    林一松
    JournalofComputerScience&Technology, 2012, 27 (02) : 240 - 255
  • [3] Detecting SDCs in GPGPUs Through Efficient Partial Thread Redundancy
    Wei, Xiaohui
    Wu, Yan
    Jiang, Nan
    Yue, Hengshan
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT VII, 2024, 14493 : 224 - 239
  • [4] Temporal Memoization for Energy-Efficient Timing Error Recovery in GPGPUs
    Rahimi, Abbas
    Benini, Luca
    Gupta, Rajesh K.
    2014 DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE AND EXHIBITION (DATE), 2014,
  • [5] Efficient KEMs with partial message recovery
    Bjorstad, Tor E.
    Dent, Alex W.
    Smart, Nigel P.
    CRYPTOGRAPHY AND CODING, PROCEEDINGS, 2007, 4887 : 233 - +
  • [6] Efficient Testing of Recovery Code Using Fault Injection
    Marinescu, Paul D.
    Candea, George
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2011, 29 (04):
  • [7] An efficient method for bearing fault diagnosis
    Geetha, G.
    Geethanjali, P.
    Systems Science and Control Engineering, 2024, 12 (01):
  • [8] An efficient method for bearing fault diagnosis
    Geetha, G.
    Geethanjali, P.
    SYSTEMS SCIENCE & CONTROL ENGINEERING, 2024, 12 (01)
  • [9] A Fault Tolerance Method for Control Systems with Full or Partial Fault Decoupling
    Zhirabok, A. N.
    Filaretov, V. F.
    Zuev, A. V.
    Shumsky, A. E.
    AUTOMATION AND REMOTE CONTROL, 2024, 85 (07) : 584 - 596
  • [10] Partial cholecystectomy is a safe and efficient method
    Cakmak, A.
    Genc, V.
    Orozakunov, E.
    Kepenekci, I.
    Cetinkaya, Oe A.
    Hazinedaroglu, M. S.
    CHIRURGIA, 2009, 104 (06) : 701 - 704