Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引：11

作者：

Cabezas, Javier ^{[1
]}

Vilanova, Lluis ^{[1
]}

Gelado, Isaac ^{[2
]}

Jablin, Thomas B. ^{[3
]}

Navarro, Nacho ^{[1
,4
]}

Hwu, Wen-mei W. ^{[3
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] NVIDIA Corp, Santa Clara, CA USA

[3] Univ Illinois, Urbana, IL USA

[4] Univ Politecn Cataluna, Barcelona, Spain

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15) | 2015年

关键词：

Multi-GPU programming; NUMA;

D O I：

10.1145/2751205.2751218

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

引用

页码：3 / 13

页数：11

共 50 条

[21] Multi-GPU System Design with Memory Networks
Kim, Gwangsun
Lee, Minseok
Jeong, Jiyun
Kim, John
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 484 - 495
[22] Extending Shared-Memory Computations to Multiple Distributed Nodes
Ahmed, Waseem
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (08) : 675 - 685
[23] Distributed texture memory in a Multi-GPU environment
Moerschell, Adam
Owens, John D.
COMPUTER GRAPHICS FORUM, 2008, 27 (01) : 130 - 151
[24] Exploiting heterogeneity of communication channels for efficient GPU selection on multi-GPU nodes
Faraji, Iman
Mirsadeghi, Seyed H.
Afsahi, Ahmad
PARALLEL COMPUTING, 2017, 68 : 3 - 16
[25] Parallelization of the ILU(0) preconditioner for CFD problems on shared-memory computers
Dutto, LC
Habashi, WG
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, 1999, 30 (08) : 995 - 1008
[26] Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes
Taherin, Amir
Patel, Tirthak
Georgakoudis, Giorgis
Laguna, Ignacio
Tiwari, Devesh
51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 305 - 313
[27] MACC: An OpenACC Transpiler for Automatic Multi-GPU Use
Matsumura, Kazuaki
Sato, Mitsuhisa
Boku, Taisuke
Podobas, Artur
Matsuoka, Satoshi
SUPERCOMPUTING FRONTIERS, SCFA 2018, 2018, 10776 : 109 - 127
[28] Statistical Modeling of Power/Energy of Scientific Kernels on a Multi-GPU system
Ghosh, Sayan
Chandrasekaran, Sunita
Chapman, Barbara
2013 INTERNATIONAL GREEN COMPUTING CONFERENCE (IGCC), 2013,
[29] Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
Lai, Jianqi
Yu, Hang
Tian, Zhengyu
Li, Hua
SCIENTIFIC PROGRAMMING, 2020, 2020
[30] JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
Matsumura, Kazuaki
de Gonzalo, Simon Garcia
Pena, Antonio J.
2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), 2021, : 182 - 191

← 1 2 3 4 5 →