Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引:11
|
作者
Cabezas, Javier [1 ]
Vilanova, Lluis [1 ]
Gelado, Isaac [2 ]
Jablin, Thomas B. [3 ]
Navarro, Nacho [1 ,4 ]
Hwu, Wen-mei W. [3 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] NVIDIA Corp, Santa Clara, CA USA
[3] Univ Illinois, Urbana, IL USA
[4] Univ Politecn Cataluna, Barcelona, Spain
关键词
Multi-GPU programming; NUMA;
D O I
10.1145/2751205.2751218
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.
引用
收藏
页码:3 / 13
页数:11
相关论文
共 50 条
  • [1] WORKLOAD-AWARE AUTOMATIC PARALLELIZATION FOR MULTI-GPU DNN TRAINING
    Shin, Sungho
    Jo, Youngmin
    Choi, Jungwook
    Venkataramani, Swagath
    Srinivasan, Vijayalakshmi
    Sung, Wonyong
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1453 - 1457
  • [2] Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
    Muthukrishnan, Harini
    Nellans, David
    Lustig, Daniel
    Fessler, Jeffrey A.
    Wenisch, Thomas F.
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 139 - 152
  • [3] Parallelization of benchmarks for scalable shared-memory multiprocessors
    Paek, Y
    Navarro, A
    Zapata, E
    Padua, D
    1998 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PROCEEDINGS, 1998, : 401 - 408
  • [4] Shared-memory parallelization of a local correlation multi-reference CI program
    Dieterich, Johannes M.
    Krisiloff, David B.
    Gaenko, Alexander
    Libisch, Florian
    Windus, Theresa L.
    Gordon, Mark S.
    Carter, Emily A.
    COMPUTER PHYSICS COMMUNICATIONS, 2014, 185 (12) : 3175 - 3188
  • [5] The Optimization of Model Parallelization Strategies for Multi-GPU Training
    Zhang, Zechao
    Chen, Jianfeng
    Hu, Bing
    2021 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2021,
  • [6] POSTER: Shared-Memory Parallelization of MTTKRP for Dense Tensors
    Hayashi, Koby
    Ballard, Grey
    Jiang, Yujie
    Tobia, Michael J.
    ACM SIGPLAN NOTICES, 2018, 53 (01) : 393 - 394
  • [7] Topology-Aware GPU Selection on Multi-GPU Nodes
    Faraji, Iman
    Mirsadeghi, Seyed H.
    Afsahi, Ahmad
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 712 - 720
  • [8] Global Shared Memory Design for Multi-GPU Graphics Cards on Personal Supercomputer
    Guo, Sen
    Chen, Sanfeng
    Liang, YongSheng
    INFORMATION TECHNOLOGY APPLICATIONS IN INDUSTRY, PTS 1-4, 2013, 263-266 : 1236 - 1241
  • [9] Multi-GPU Parallelization of the NAS Multi-Zone Parallel Benchmarks
    Gonzalez, Marc
    Morancho, Enric
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (01) : 229 - 241
  • [10] Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations
    Hermann, Everton
    Raffin, Bruno
    Faure, Francois
    Gautier, Thierry
    Allard, Jeremie
    EURO-PAR 2010 - PARALLEL PROCESSING, PART II, 2010, 6272 : 235 - 246