Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引:11
|
作者
Cabezas, Javier [1 ]
Vilanova, Lluis [1 ]
Gelado, Isaac [2 ]
Jablin, Thomas B. [3 ]
Navarro, Nacho [1 ,4 ]
Hwu, Wen-mei W. [3 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] NVIDIA Corp, Santa Clara, CA USA
[3] Univ Illinois, Urbana, IL USA
[4] Univ Politecn Cataluna, Barcelona, Spain
关键词
Multi-GPU programming; NUMA;
D O I
10.1145/2751205.2751218
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.
引用
收藏
页码:3 / 13
页数:11
相关论文
共 50 条
  • [21] Multi-GPU System Design with Memory Networks
    Kim, Gwangsun
    Lee, Minseok
    Jeong, Jiyun
    Kim, John
    2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 484 - 495
  • [22] Extending Shared-Memory Computations to Multiple Distributed Nodes
    Ahmed, Waseem
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (08) : 675 - 685
  • [23] Distributed texture memory in a Multi-GPU environment
    Moerschell, Adam
    Owens, John D.
    COMPUTER GRAPHICS FORUM, 2008, 27 (01) : 130 - 151
  • [24] Exploiting heterogeneity of communication channels for efficient GPU selection on multi-GPU nodes
    Faraji, Iman
    Mirsadeghi, Seyed H.
    Afsahi, Ahmad
    PARALLEL COMPUTING, 2017, 68 : 3 - 16
  • [25] Parallelization of the ILU(0) preconditioner for CFD problems on shared-memory computers
    Dutto, LC
    Habashi, WG
    INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN FLUIDS, 1999, 30 (08) : 995 - 1008
  • [26] Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes
    Taherin, Amir
    Patel, Tirthak
    Georgakoudis, Giorgis
    Laguna, Ignacio
    Tiwari, Devesh
    51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 305 - 313
  • [27] MACC: An OpenACC Transpiler for Automatic Multi-GPU Use
    Matsumura, Kazuaki
    Sato, Mitsuhisa
    Boku, Taisuke
    Podobas, Artur
    Matsuoka, Satoshi
    SUPERCOMPUTING FRONTIERS, SCFA 2018, 2018, 10776 : 109 - 127
  • [28] Statistical Modeling of Power/Energy of Scientific Kernels on a Multi-GPU system
    Ghosh, Sayan
    Chandrasekaran, Sunita
    Chapman, Barbara
    2013 INTERNATIONAL GREEN COMPUTING CONFERENCE (IGCC), 2013,
  • [29] Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
    Lai, Jianqi
    Yu, Hang
    Tian, Zhengyu
    Li, Hua
    SCIENTIFIC PROGRAMMING, 2020, 2020
  • [30] JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
    Matsumura, Kazuaki
    de Gonzalo, Simon Garcia
    Pena, Antonio J.
    2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), 2021, : 182 - 191