Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引：11

作者：

Cabezas, Javier ^{[1
]}

Vilanova, Lluis ^{[1
]}

Gelado, Isaac ^{[2
]}

Jablin, Thomas B. ^{[3
]}

Navarro, Nacho ^{[1
,4
]}

Hwu, Wen-mei W. ^{[3
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] NVIDIA Corp, Santa Clara, CA USA

[3] Univ Illinois, Urbana, IL USA

[4] Univ Politecn Cataluna, Barcelona, Spain

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15) | 2015年

关键词：

Multi-GPU programming; NUMA;

D O I：

10.1145/2751205.2751218

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

引用

页码：3 / 13

页数：11

共 50 条

[1] WORKLOAD-AWARE AUTOMATIC PARALLELIZATION FOR MULTI-GPU DNN TRAINING
Shin, Sungho
Jo, Youngmin
Choi, Jungwook
Venkataramani, Swagath
Srinivasan, Vijayalakshmi
Sung, Wonyong
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1453 - 1457
[2] Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
Muthukrishnan, Harini
Nellans, David
Lustig, Daniel
Fessler, Jeffrey A.
Wenisch, Thomas F.
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 139 - 152
[3] Parallelization of benchmarks for scalable shared-memory multiprocessors
Paek, Y
Navarro, A
Zapata, E
Padua, D
1998 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PROCEEDINGS, 1998, : 401 - 408
[4] Shared-memory parallelization of a local correlation multi-reference CI program
Dieterich, Johannes M.
Krisiloff, David B.
Gaenko, Alexander
Libisch, Florian
Windus, Theresa L.
Gordon, Mark S.
Carter, Emily A.
COMPUTER PHYSICS COMMUNICATIONS, 2014, 185 (12) : 3175 - 3188
[5] The Optimization of Model Parallelization Strategies for Multi-GPU Training
Zhang, Zechao
Chen, Jianfeng
Hu, Bing
2021 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2021,
[6] POSTER: Shared-Memory Parallelization of MTTKRP for Dense Tensors
Hayashi, Koby
Ballard, Grey
Jiang, Yujie
Tobia, Michael J.
ACM SIGPLAN NOTICES, 2018, 53 (01) : 393 - 394
[7] Topology-Aware GPU Selection on Multi-GPU Nodes
Faraji, Iman
Mirsadeghi, Seyed H.
Afsahi, Ahmad
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 712 - 720
[8] Global Shared Memory Design for Multi-GPU Graphics Cards on Personal Supercomputer
Guo, Sen
Chen, Sanfeng
Liang, YongSheng
INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 1236 - 1241
[9] Multi-GPU Parallelization of the NAS Multi-Zone Parallel Benchmarks
Gonzalez, Marc
Morancho, Enric
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (01) : 229 - 241
[10] Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations
Hermann, Everton
Raffin, Bruno
Faure, Francois
Gautier, Thierry
Allard, Jeremie
EURO-PAR 2010 - PARALLEL PROCESSING, PART II, 2010, 6272 : 235 - 246

← 1 2 3 4 5 →