Efficient checkpoint/Restart of CUDA applications

被引:0
|
作者
Nukada, Akira [1 ]
Suzuki, Taichiro [3 ]
Matsuoka, Satoshi [2 ,3 ]
机构
[1] Univ Tsukuba, CCS, 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577, Japan
[2] RIKEN R CCS, 7-1-26 Minatojima Minami Machi,Chuo Ku, Kobe, Hyogo 6500047, Japan
[3] Tokyo Inst Technol, 2-12-1 Oookayama,Meguro Ku, Tokyo 1528550, Japan
关键词
Checkpoint and restart; NVIDIA CUDA; GPU;
D O I
10.1016/j.parco.2023.103018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR' overhead is acceptably low.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
    Nicolae, Bogdan
    Cappello, Franck
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (05) : 698 - 711
  • [22] A Flexible Checkpoint/Restart Model in Distributed Systems
    Bouguerra, Mohamed-Slim
    Gautier, Thierry
    Trystram, Denis
    Vincent, Jean-Marc
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 206 - +
  • [23] Checkpoint/Restart in Practice: When 'Simple is Better'
    El-Sayed, Nosayba
    Schroeder, Bianca
    2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 84 - 92
  • [24] Interconnect Agnostic Checkpoint/Restart in Open MPI
    Hursey, Joshua
    Mattox, Timothy I.
    Lumsdaine, Andrew
    HPDC'09: 18TH ACM INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, 2009, : 49 - 58
  • [25] Checkpoint and Restart: An Energy Consumption Characterization in Clusters
    Moran, Marina
    Balladini, Javier
    Rexachs, Dolores
    Luque, Emilio
    COMPUTER SCIENCE - CACIC 2018, 2019, 995 : 19 - 33
  • [26] Checkpoint-Restart for a Network of Virtual Machines
    Garg, Rohan
    Sodha, Komal
    Jin, Zhengping
    Cooperman, Gene
    2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,
  • [27] Prediction of Energy Consumption by Checkpoint/Restart in HPC
    Moran, M.
    Balladini, I
    Rexachs, D.
    Luque, E.
    IEEE ACCESS, 2019, 7 : 71791 - 71803
  • [28] Distributed Speculative Parallelization using Checkpoint Restart
    Ghoshal, Devarshi
    Ramkumar, Sreesudhan R.
    Chauhan, Arun
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS), 2011, 4 : 422 - 431
  • [29] Checkpoint/Restart-Enabled Parallel Debugging
    Hursey, Joshua
    January, Chris
    O'Connor, Mark
    Hargrove, Paul H.
    Lecomber, David
    Squyres, Jeffrey M.
    Lumsdaine, Andrew
    RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2010, 6305 : 219 - +
  • [30] Parallel checkpoint/restart without message logging
    Meth, KZ
    Tuel, WG
    2000 INTERNATIONAL WORKSHOPS ON PARALLEL PROCESSING, PROCEEDINGS, 2000, : 253 - 258