Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes

被引:11
|
作者
Cabezas, Javier [1 ]
Vilanova, Lluis [1 ]
Gelado, Isaac [2 ]
Jablin, Thomas B. [3 ]
Navarro, Nacho [1 ,4 ]
Hwu, Wen-mei W. [3 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] NVIDIA Corp, Santa Clara, CA USA
[3] Univ Illinois, Urbana, IL USA
[4] Univ Politecn Cataluna, Barcelona, Spain
关键词
Multi-GPU programming; NUMA;
D O I
10.1145/2751205.2751218
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.
引用
收藏
页码:3 / 13
页数:11
相关论文
共 50 条
  • [41] Parallelization Efficiency of Multi-GPU In-Core LU-Decomposition of Dense Matrices
    Teneh, Nimrod
    Mrdakovic, Branko Lj
    Kostic, Milan M.
    Olcan, Dragan, I
    Kolundzija, Branko M.
    2019 IEEE INTERNATIONAL SYMPOSIUM ON ANTENNAS AND PROPAGATION AND USNC-URSI RADIO SCIENCE MEETING, 2019, : 1253 - 1254
  • [42] Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle
    Ben-Nun, Tal
    Levy, Ely
    Barak, Amnon
    Rubin, Eri
    PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
  • [43] Efficient Multi-GPU Memory Management for Deep Learning Acceleration
    Kim, Youngrang
    Lee, Jaehwan
    Kim, Jik-Soo
    Jei, Hyunseung
    Roh, Hongchan
    2018 IEEE 3RD INTERNATIONAL WORKSHOPS ON FOUNDATIONS AND APPLICATIONS OF SELF* SYSTEMS (FAS*W), 2018, : 37 - 43
  • [44] A Multi-GPU Framework for In-Memory Text Data Analytics
    Chong, Poh Kit
    Karuppiah, Ettikan K.
    Yong, Keh Kok
    2013 IEEE 27TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS WORKSHOPS (WAINA), 2013, : 1411 - 1416
  • [45] On automatic parallelization of irregular reductions on scalable shared memory systems
    Gutiérrez, E
    Plata, O
    Zapata, EL
    EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 422 - 429
  • [46] PARALLELIZATION AND PERFORMANCE ANALYSIS OF THE COOLEY-TUKEY FFT ALGORITHM FOR SHARED-MEMORY ARCHITECTURES
    NORTON, A
    SILBERGER, AJ
    IEEE TRANSACTIONS ON COMPUTERS, 1987, 36 (05) : 581 - 591
  • [47] Efficient Parallelization of Path Planning Workload on Single-chip Shared-memory Multicores
    Ahmad, Masab
    Lakshminarasimhan, Kartik
    Khan, Omer
    2015 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2015,
  • [48] Afivo: A framework for quadtree/octree AMR with shared-memory parallelization and geometric multigrid methods
    Teunissen, Jannis
    Ebert, Ute
    COMPUTER PHYSICS COMMUNICATIONS, 2018, 233 : 156 - 166
  • [49] Scalable Shared-Memory Parallelization of the Block Recursive Inversion Algorithm Poster extended abstract
    Silva, Maria C. M.
    Cosme, Iria C. S.
    Sardina, Idalmis M.
    Xavier-de-Souza, Samuel
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 154 - 155
  • [50] Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture
    Zhang J.
    Li R.
    Deng L.
    Dai Z.
    Liu J.
    Xu C.
    Hangkong Xuebao/Acta Aeronautica et Astronautica Sinica, 2024, 45 (07):