Efficient checkpoint/Restart of CUDA applications

被引:0
|
作者
Nukada, Akira [1 ]
Suzuki, Taichiro [3 ]
Matsuoka, Satoshi [2 ,3 ]
机构
[1] Univ Tsukuba, CCS, 1-1-1 Tennodai, Tsukuba, Ibaraki 3058577, Japan
[2] RIKEN R CCS, 7-1-26 Minatojima Minami Machi,Chuo Ku, Kobe, Hyogo 6500047, Japan
[3] Tokyo Inst Technol, 2-12-1 Oookayama,Meguro Ku, Tokyo 1528550, Japan
关键词
Checkpoint and restart; NVIDIA CUDA; GPU;
D O I
10.1016/j.parco.2023.103018
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present NVCR which enables transparent checkpoint and restart of CUDA applications. NVCR, works as an extension of major system-level checkpoint software such as BLCR and DMTCP, employs proxy-process and application accesses GPU devices via the proxy-process to improve the compatibility with latest CUDA runtime software. To reduce the overhead of inter-process communications, NVCR efficiently uses SYSV IPC shared memory as CUDA pinned memory. Performance evaluations using micro benchmarks and Amber as a real application show that NVCR' overhead is acceptably low.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Job migration in HPC clusters by means of checkpoint/restart
    Manuel Rodríguez-Pascual
    Jiajun Cao
    José A. Moríñigo
    Gene Cooperman
    Rafael Mayo-García
    The Journal of Supercomputing, 2019, 75 : 6517 - 6541
  • [42] An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
    Levy, Scott
    Ferreira, Kurt B.
    PROCEEDINGS OF THE ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE (FTXS'16), 2016, : 35 - 42
  • [43] Job migration in HPC clusters by means of checkpoint/restart
    Rodriguez-Pascual, Manuel
    Cao, Jiajun
    Morinigo, Jose A.
    Cooperman, Gene
    Mayo-Garcia, Rafael
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (10): : 6517 - 6541
  • [44] AN EFFICIENT SORTING ALGORITHM WITH CUDA
    Chen, Shifu
    Qin, Jing
    Xie, Yongming
    Zhao, Junping
    Heng, Pheng-Ann
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2009, 32 (07) : 915 - 921
  • [45] Two efficient nonlinear conjugate gradient methods with restart procedures and their applications in image restoration
    Jiang, Xian-Zhen
    Zhu, Yi-Han
    Jian, Jin-Bao
    NONLINEAR DYNAMICS, 2023, 111 (06) : 5469 - 5498
  • [46] Two efficient nonlinear conjugate gradient methods with restart procedures and their applications in image restoration
    Xian-Zhen Jiang
    Yi-Han Zhu
    Jin-Bao Jian
    Nonlinear Dynamics, 2023, 111 : 5469 - 5498
  • [47] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
    Gholami, Masoud
    Schintke, Florian
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
  • [48] Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart
    Ahmed, Arif
    Mohan, Apoorve
    Cooperman, Gene
    Pierre, Guillaume
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON MOBILE CLOUD COMPUTING, SERVICES, AND ENGINEERING (MOBILE CLOUD 2020), 2020, : 55 - 62
  • [49] Exploration of Lossy Compression for Application-level Checkpoint/Restart
    Sasaki, Naoto
    Sato, Kento
    Endo, Toshio
    Matsuoka, Satoshi
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 914 - 922
  • [50] A Fast Restart Mechanism for Checkpoint/Recovery Protocols in Networked Environments
    Li, Yawei
    Lan, Zhiling
    2008 IEEE INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS & NETWORKS WITH FTCS & DCC, 2008, : 217 - 226