Efficient checkpoint/Restart of CUDA applications

被引:0
|
作者
Nukada, Akira [1 ]
Suzuki, Taichiro [3 ]
Matsuoka, Satoshi [2 ,3 ]
机构
[1] Univ Tsukuba, CCS, 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577, Japan
[2] RIKEN R CCS, 7-1-26 Minatojima Minami Machi,Chuo Ku, Kobe, Hyogo 6500047, Japan
[3] Tokyo Inst Technol, 2-12-1 Oookayama,Meguro Ku, Tokyo 1528550, Japan
关键词
Checkpoint and restart; NVIDIA CUDA; GPU;
D O I
10.1016/j.parco.2023.103018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR' overhead is acceptably low.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
    Takizawa, Hiroyuki
    Sato, Katsuto
    Komatsu, Kazuhiko
    Kobayashi, Hiroaki
    2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 408 - +
  • [2] A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
    Zhang, Yulu
    Guo, Xinyuan
    Jiang, Hai
    Li, Kuan-Ching
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 247 - 252
  • [3] Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support
    Eiling, Niklas
    Baude, Jonas
    Lankes, Stefan
    Monti, Antonello
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (14):
  • [4] A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
    Jiang, Hai
    Zhang, Yulu
    Jenness, Jeff
    Li, Kuan-Ching
    INTERNATIONAL JOURNAL OF NETWORKED AND DISTRIBUTED COMPUTING, 2013, 1 (04) : 196 - 212
  • [5] A checkpoint/restart scheme for CUDA programs with complex computation states
    Jiang H.
    Zhang Y.
    Jenness J.
    Li K.-C.
    International Journal of Networked and Distributed Computing, 2013, 1 (4) : 196 - 212
  • [6] CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
    Jain, Twinkle
    Cooperman, Gene
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [7] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
    Garg, Rohan
    Mohan, Apoorve
    Sullivan, Michael
    Cooperman, Gene
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
  • [8] Checkpoint Restart Support for Heterogeneous HPC Applications
    Parasyris, Konstantinos
    Keller, Kai
    Bautista-Gomez, Leonardo
    Unsal, Osman
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 242 - 251
  • [9] Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart
    Zhang, Jialing
    Zhuo, Xiaoyan
    Moon, Aekyeung
    Liu, Hang
    Son, Seung Woo
    2019 35TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST 2019), 2019, : 79 - 91
  • [10] cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA-enabled GPUs
    Pourghassemi, Behnam
    Chandramowlishwaran, Aparna
    2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 725 - 732