Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

被引:3
|
作者
Eiling, Niklas [1 ]
Baude, Jonas [1 ]
Lankes, Stefan [1 ]
Monti, Antonello [1 ]
机构
[1] Rhein Westfal TH Aachen, EON Energy Res Ctr, Inst Automat Complex Power Syst, Mathieustr 10, D-52074 Aachen, Germany
来源
基金
欧盟地平线“2020”;
关键词
checkpoint; restart; GPU; remote execution; virtualization;
D O I
10.1002/cpe.6474
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In high-performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine-grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low-overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open-source nature. Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high-performance and cloud computing.
引用
收藏
页数:14
相关论文
共 10 条
  • [1] Efficient checkpoint/Restart of CUDA applications
    Nukada, Akira
    Suzuki, Taichiro
    Matsuoka, Satoshi
    PARALLEL COMPUTING, 2023, 116
  • [2] CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
    Takizawa, Hiroyuki
    Sato, Katsuto
    Komatsu, Kazuhiko
    Kobayashi, Hiroaki
    2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 408 - +
  • [3] A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
    Zhang, Yulu
    Guo, Xinyuan
    Jiang, Hai
    Li, Kuan-Ching
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 247 - 252
  • [4] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
    Garg, Rohan
    Mohan, Apoorve
    Sullivan, Michael
    Cooperman, Gene
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
  • [5] An Open-Source Virtualization Layer for CUDA Applications
    Eiling, Niklas
    Lankes, Stefan
    Monti, Antonello
    EURO-PAR 2020: PARALLEL PROCESSING WORKSHOPS, 2021, 12480 : 160 - 171
  • [6] Checkpoint Restart Support for Heterogeneous HPC Applications
    Parasyris, Konstantinos
    Keller, Kai
    Bautista-Gomez, Leonardo
    Unsal, Osman
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 242 - 251
  • [7] Transparent checkpoint-restart of distributed applications on commodity clusters
    Laadan, Oren
    Phung, Dan
    Nieh, Jason
    2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2006, : 52 - +
  • [8] An improved virtualization layer to support distribution of multimedia contents in pervasive social applications
    Fernando Bravo-Torres, Jack
    Lopez-Nores, Martin
    Blanco-Fernandez, Yolanda
    Juan Pazos-Arias, Jose
    Ramos-Cabrer, Manuel
    Gil-Solla, Alberto
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2015, 51 : 1 - 17
  • [9] Support for data-intensive, variable-granularity grid applications via distributed file system virtualization - A case study of light scattering spectroscopy
    Paladugula, J
    Zhao, M
    Figueiredo, RJ
    PROCEEDINGS OF THE SECOND INTERNATIONAL WORKSHOP ON CHALLENGES OF LARGE APPLICATIONS IN DISTRIBUTED ENVIRONMENTS, 2004, : 12 - 21
  • [10] IA-32 execution layer:: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems
    Baraz, L
    Devor, T
    Etzion, O
    Goldenberg, S
    Skaletsky, A
    Wang, Y
    Zemach, Y
    36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, 2003, : 191 - 201