Efficient checkpoint/Restart of CUDA applications

被引:0
|
作者
Nukada, Akira [1 ]
Suzuki, Taichiro [3 ]
Matsuoka, Satoshi [2 ,3 ]
机构
[1] Univ Tsukuba, CCS, 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577, Japan
[2] RIKEN R CCS, 7-1-26 Minatojima Minami Machi,Chuo Ku, Kobe, Hyogo 6500047, Japan
[3] Tokyo Inst Technol, 2-12-1 Oookayama,Meguro Ku, Tokyo 1528550, Japan
关键词
Checkpoint and restart; NVIDIA CUDA; GPU;
D O I
10.1016/j.parco.2023.103018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR' overhead is acceptably low.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms
    Cameron, D.
    Elmsheuser, J.
    Heinrich, L.
    Lavrijsen, W.
    Nilsson, P.
    Tsulaia, V.
    Vogel, M.
    18TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2017), 2018, 1085
  • [32] Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines Over Infiniband
    Villa, Oreste
    Krishnamoorthy, Sriram
    Nieplocha, Jarek
    Brown, David M. Jr
    CF'09: CONFERENCE ON COMPUTING FRONTIERS & WORKSHOPS, 2009, : 197 - 206
  • [33] CUDA Flux: A Lightweight Instruction Profiler for CUDA Applications
    Braun, Lorenz
    Froning, Holger
    PROCEEDINGS OF 2019 IEEE/ACM PERFORMANCE MODELING, BENCHMARKING AND SIMULATION OF HIGH PERFORMANCE COMPUTER SYSTEMS (PMBS 2019), 2019, : 73 - 81
  • [34] Berkeley lab checkpoint/restart (BLCR) for Linux clusters
    Hargrove, Paul H.
    Duell, Jason C.
    SCIDAC 2006: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2006, 46 : 494 - 499
  • [35] A model for predicting the optimum checkpoint interval for restart dumps
    Daly, J
    COMPUTATIONAL SCIENCE - ICCS 2003, PT IV, PROCEEDINGS, 2003, 2660 : 3 - 12
  • [36] Efficient Execution of Multiple CUDA Applications Using Transparent Suspend, Resume and Migration
    Suzuki, Taichiro
    Nukada, Akira
    Matsuoka, Satoshi
    EURO-PAR 2015: PARALLEL PROCESSING, 2015, 9233 : 687 - 699
  • [37] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
    Shaiizad, Faisal
    Wittmann, Markus
    Kreutzer, Moritz
    Zeiser, Thomas
    Haler, Ceorc
    Wellein, Gerhahd
    PARALLEL PROCESSING LETTERS, 2013, 23 (04)
  • [38] DMTCP: Bringing interactive checkpoint-restart to Python
    Arya, Kapil
    Cooperman, Gene
    Computational Science and Discovery, 2015, 8 (01)
  • [39] Checkpoint and restart for distributed components in XCAT3
    Krishnan, S
    Gannon, D
    FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
  • [40] Virtualization aware job schedulers for checkpoint-restart
    Badrinath, R.
    Krishnakumar, R.
    Rajan, R. K. Palanivel
    2007 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, VOLS 1 AND 2, 2007, : 876 - 882