A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

被引:11
|
作者
Chen, Peng [1 ,2 ]
Wahib, Mohamed [2 ]
Takizawa, Shinichiro [2 ]
Takano, Ryousei [3 ]
Matsuoka, Satoshi [1 ,4 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
[2] Natl Inst Adv Ind Sci & Technol, AIST Tokyo Tech Real World Big Data Computat Open, Tsukuba, Ibaraki, Japan
[3] Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki, Japan
[4] RIKEN Ctr Computat Sci, Kobe, Hyogo, Japan
关键词
Systolic Array; GPU; CUDA; Convolution; Stencil; OPTIMIZATION; ARRAYS; DESIGN;
D O I
10.1145/3295500.3356162
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.
引用
收藏
页数:81
相关论文
共 18 条
  • [1] Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU
    Abdelfattah, Ahmad
    Keyes, David
    Ltaief, Hatem
    EURO-PAR 2012: PARALLEL PROCESSING WORKSHOPS, 2013, 7640 : 207 - 216
  • [2] PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications
    Zhang, Lingqi
    Wahib, Mohamed
    Chen, Peng
    Meng, Jintao
    Wang, Xiao
    Endo, Toshio
    Matsuoka, Satoshi
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023, 2023, : 167 - 179
  • [3] Analytic performance model for parallel overlapping memory-bound kernels
    Afzal, Ayesha
    Hager, Georg
    Wellein, Gerhard
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (10):
  • [4] Scalable Kernel Fusion for Memory-Bound GPU Applications
    Wahib, Mohamed
    Maruyama, Naoya
    SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 191 - 202
  • [5] Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators
    Abdelfattah, Ahmad
    Dongarra, Jack
    Keyes, David
    Ltaief, Hatem
    HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2012, 2013, 7851 : 72 - 79
  • [6] A practical performance model for compute and memory bound GPU kernels
    Konstantinidis, Elias
    Cotronis, Yiannis
    23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 651 - 658
  • [7] Harvesting Memory-bound CPU Stall Cycles in Software with MSH
    Luo, Zhihong
    Son, Sam
    Ratnasamy, Sylvia
    Shenker, Scott
    PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2024, 2024, : 57 - 75
  • [8] Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications
    Orenes-Vera, Marcelo
    Tureci, Esin
    Wentzlaff, David
    Martonosi, Margaret
    2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 718 - 730
  • [9] Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives
    Gowanlock, Michael
    Karsin, Ben
    Fink, Zane
    Wright, Jordan
    15TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE (DAMON 2019), 2019,
  • [10] Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs
    Mukunoki, Daichi
    Imamura, Toshiyuki
    Takahashi, Daisuke
    2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC), 2016, : 377 - 384