A checkpoint/restart scheme for CUDA programs with complex computation states

被引:0
|
作者
Jiang H. [1 ]
Zhang Y. [1 ]
Jenness J. [1 ]
Li K.-C. [2 ]
机构
[1] Department of Computer Science, Arkansas State University
[2] Department of Computer Science and Information Engr., Providence University
关键词
Checkpoint/start; CUDA; Fault tolerance; GPU;
D O I
10.2991/ijndc.2013.1.4.2
中图分类号
学科分类号
摘要
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme. © 2013, Atlantis Press. All rights reserved.
引用
收藏
页码:196 / 212
页数:16
相关论文
共 36 条
  • [1] A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
    Jiang, Hai
    Zhang, Yulu
    Jenness, Jeff
    Li, Kuan-Ching
    INTERNATIONAL JOURNAL OF NETWORKED AND DISTRIBUTED COMPUTING, 2013, 1 (04) : 196 - 212
  • [2] A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
    Zhang, Yulu
    Guo, Xinyuan
    Jiang, Hai
    Li, Kuan-Ching
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 247 - 252
  • [3] Efficient checkpoint/Restart of CUDA applications
    Nukada, Akira
    Suzuki, Taichiro
    Matsuoka, Satoshi
    PARALLEL COMPUTING, 2023, 116
  • [4] CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
    Takizawa, Hiroyuki
    Sato, Katsuto
    Komatsu, Kazuhiko
    Kobayashi, Hiroaki
    2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 408 - +
  • [5] cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA-enabled GPUs
    Pourghassemi, Behnam
    Chandramowlishwaran, Aparna
    2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 725 - 732
  • [6] CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
    Jain, Twinkle
    Cooperman, Gene
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [7] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
    Garg, Rohan
    Mohan, Apoorve
    Sullivan, Michael
    Cooperman, Gene
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
  • [8] Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support
    Eiling, Niklas
    Baude, Jonas
    Lankes, Stefan
    Monti, Antonello
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (14):
  • [9] Application-transparent checkpoint/restart for MPI programs over InfiniBand
    Gao, Qi
    Yu, Weikuan
    Huang, Wei
    Panda, Dhabaleswar K.
    2006 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, PROCEEDINGS, 2006, : 471 - 478
  • [10] Mutable Checkpoint-Restart: Automating Live Update for Generic Server Programs
    Giuffrida, Cristiano
    Iorgulescu, Calin
    Tanenbaum, Andrew S.
    ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, : 133 - 144