CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

被引：0

作者：

DeHao Chen

WenGuang Chen

WeiMin Zheng

机构：

[1] Tsinghua University,Department of Computer Science and Technology

来源：

Science China Information Sciences | 2012年 / 55卷

关键词：

CUDA; parallelization; data access pattern; multi-GPU;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.

引用

页码：663 / 676

页数：13

共 41 条

[31] An optimization-based shared control framework with applications in multi-robot systems
Fang, Hao
Shang, Chengsi
Chen, Jie
[J]. SCIENCE CHINA-INFORMATION SCIENCES, 2018, 61 (01)
[32] An optimization-based shared control framework with applications in multi-robot systems
Hao Fang
Chengsi Shang
Jie Chen
[J]. Science China Information Sciences, 2018, 61
[33] Partially-shared zero-suppressed multi-terminal BDDs: concept, algorithms and applications
Kai Lampka
Markus Siegle
Joern Ossowski
Christel Baier
[J]. Formal Methods in System Design, 2010, 36 : 198 - 222
[34] Partially-shared zero-suppressed multi-terminal BDDs: concept, algorithms and applications
Lampka, Kai
Siegle, Markus
Ossowski, Joern
Baier, Christel
[J]. FORMAL METHODS IN SYSTEM DESIGN, 2010, 36 (03) : 198 - 222
[35] Multi-GPU systems and Unified Virtual Memory for scientific applications: The case of the NAS multi-zone parallel benchmarks
Gonzalez, Marc
Morancho, Enric
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 158 : 138 - 150
[36] An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems
Zhou, Binbin
Lu, Lu
[J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (15): : 17055 - 17073
[37] An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems
Binbin Zhou
Lu Lu
[J]. The Journal of Supercomputing, 2022, 78 : 17055 - 17073
[38] Multi-GPU multi-resolution SPH framework towards massive hydrodynamics simulations and its applications in high-speed water entry
Zhao, Zhen-Xi
Bilotta, Giuseppe
Yuan, Qin-Er
Gong, Zhao-Xin
Liu, Hua
[J]. JOURNAL OF COMPUTATIONAL PHYSICS, 2023, 490
[39] Power-aware scheduling of real-time applications onto MPSoC platforms with multi-bank shared memory
Nogueira, Bruno
Andrade, Ermeson
Tavares, Eduardo
[J]. MICROPROCESSORS AND MICROSYSTEMS, 2019, 67 : 93 - 102
[40] A study of predictable execution models implementation for industrial data-flow applications on a multi-core platform with shared banked memory
Schuh, Matheus
Maiza, Claire
Goossens, Joel
Raymond, Pascal
de Dinechin, Benoit Dupont
[J]. 2020 IEEE 41ST REAL-TIME SYSTEMS SYMPOSIUM (RTSS), 2020, : 283 - 295

← 1 2 3 4 5 →