Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引：11

作者：

Cabezas, Javier ^{[1
]}

Vilanova, Lluis ^{[1
]}

Gelado, Isaac ^{[2
]}

Jablin, Thomas B. ^{[3
]}

Navarro, Nacho ^{[1
,4
]}

Hwu, Wen-mei W. ^{[3
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] NVIDIA Corp, Santa Clara, CA USA

[3] Univ Illinois, Urbana, IL USA

[4] Univ Politecn Cataluna, Barcelona, Spain

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15) | 2015年

关键词：

Multi-GPU programming; NUMA;

D O I：

10.1145/2751205.2751218

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

引用

页码：3 / 13

页数：11

共 50 条

[41] Parallelization Efficiency of Multi-GPU In-Core LU-Decomposition of Dense Matrices
Teneh, Nimrod
Mrdakovic, Branko Lj
Kostic, Milan M.
Olcan, Dragan, I
Kolundzija, Branko M.
2019 IEEE INTERNATIONAL SYMPOSIUM ON ANTENNAS AND PROPAGATION AND USNC-URSI RADIO SCIENCE MEETING, 2019, : 1253 - 1254
[42] Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle
Ben-Nun, Tal
Levy, Ely
Barak, Amnon
Rubin, Eri
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[43] Efficient Multi-GPU Memory Management for Deep Learning Acceleration
Kim, Youngrang
Lee, Jaehwan
Kim, Jik-Soo
Jei, Hyunseung
Roh, Hongchan
2018 IEEE 3RD INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2018, : 37 - 43
[44] A Multi-GPU Framework for In-Memory Text Data Analytics
Chong, Poh Kit
Karuppiah, Ettikan K.
Yong, Keh Kok
2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (WAINA), 2013, : 1411 - 1416
[45] On automatic parallelization of irregular reductions on scalable shared memory systems
Gutiérrez, E
Plata, O
Zapata, EL
EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 422 - 429
[46] PARALLELIZATION AND PERFORMANCE ANALYSIS OF THE COOLEY-TUKEY FFT ALGORITHM FOR SHARED-MEMORY ARCHITECTURES
NORTON, A
SILBERGER, AJ
IEEE TRANSACTIONS ON COMPUTERS, 1987, 36 (05) : 581 - 591
[47] Efficient Parallelization of Path Planning Workload on Single-chip Shared-memory Multicores
Ahmad, Masab
Lakshminarasimhan, Kartik
Khan, Omer
2015 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2015,
[48] Afivo: A framework for quadtree/octree AMR with shared-memory parallelization and geometric multigrid methods
Teunissen, Jannis
Ebert, Ute
COMPUTER PHYSICS COMMUNICATIONS, 2018, 233 : 156 - 166
[49] Scalable Shared-Memory Parallelization of the Block Recursive Inversion Algorithm Poster extended abstract
Silva, Maria C. M.
Cosme, Iria C. S.
Sardina, Idalmis M.
Xavier-de-Souza, Samuel
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 154 - 155
[50] Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture
Zhang J.
Li R.
Deng L.
Dai Z.
Liu J.
Xu C.
Hangkong Xuebao/Acta Aeronautica et Astronautica Sinica, 2024, 45 (07):

← 1 2 3 4 5 →