Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

被引：3

作者：

Eiling, Niklas ^{[1
]}

Baude, Jonas ^{[1
]}

Lankes, Stefan ^{[1
]}

Monti, Antonello ^{[1
]}

机构：

[1] Rhein Westfal TH Aachen, EON Energy Res Ctr, Inst Automat Complex Power Syst, Mathieustr 10, D-52074 Aachen, Germany

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2022年 / 34卷 / 14期

基金：

欧盟地平线“2020”;

关键词：

checkpoint; restart; GPU; remote execution; virtualization;

D O I：

10.1002/cpe.6474

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In high-performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine-grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low-overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open-source nature. Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high-performance and cloud computing.

引用

页数：14

共 10 条

[1] Efficient checkpoint/Restart of CUDA applications
Nukada, Akira
Suzuki, Taichiro
Matsuoka, Satoshi
PARALLEL COMPUTING, 2023, 116
[2] CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
Takizawa, Hiroyuki
Sato, Katsuto
Komatsu, Kazuhiko
Kobayashi, Hiroaki
2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 408 - +
[3] A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
Zhang, Yulu
Guo, Xinyuan
Jiang, Hai
Li, Kuan-Ching
2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 247 - 252
[4] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
Garg, Rohan
Mohan, Apoorve
Sullivan, Michael
Cooperman, Gene
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
[5] An Open-Source Virtualization Layer for CUDA Applications
Eiling, Niklas
Lankes, Stefan
Monti, Antonello
EURO-PAR 2020: PARALLEL PROCESSING WORKSHOPS, 2021, 12480 : 160 - 171
[6] Checkpoint Restart Support for Heterogeneous HPC Applications
Parasyris, Konstantinos
Keller, Kai
Bautista-Gomez, Leonardo
Unsal, Osman
2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 242 - 251
[7] Transparent checkpoint-restart of distributed applications on commodity clusters
Laadan, Oren
Phung, Dan
Nieh, Jason
2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2006, : 52 - +
[8] An improved virtualization layer to support distribution of multimedia contents in pervasive social applications
Fernando Bravo-Torres, Jack
Lopez-Nores, Martin
Blanco-Fernandez, Yolanda
Juan Pazos-Arias, Jose
Ramos-Cabrer, Manuel
Gil-Solla, Alberto
JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2015, 51 : 1 - 17
[9] Support for data-intensive, variable-granularity grid applications via distributed file system virtualization - A case study of light scattering spectroscopy
Paladugula, J
Zhao, M
Figueiredo, RJ
PROCEEDINGS OF THE SECOND INTERNATIONAL WORKSHOP ON CHALLENGES OF LARGE APPLICATIONS IN DISTRIBUTED ENVIRONMENTS, 2004, : 12 - 21
[10] IA-32 execution layer:: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems
Baraz, L
Devor, T
Etzion, O
Goldenberg, S
Skaletsky, A
Wang, Y
Zemach, Y
36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, PROCEEDINGS, 2003, : 191 - 201

← 1 →