A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

被引：11

作者：

Chen, Peng ^{[1
,2
]}

Wahib, Mohamed ^{[2
]}

Takizawa, Shinichiro ^{[2
]}

Takano, Ryousei ^{[3
]}

Matsuoka, Satoshi ^{[1
,4
]}

机构：

[1] Tokyo Inst Technol, Tokyo, Japan

[2] Natl Inst Adv Ind Sci & Technol, AIST Tokyo Tech Real World Big Data Computat Open, Tsukuba, Ibaraki, Japan

[3] Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki, Japan

[4] RIKEN Ctr Computat Sci, Kobe, Hyogo, Japan

来源：

PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2019年

关键词：

Systolic Array; GPU; CUDA; Convolution; Stencil; OPTIMIZATION; ARRAYS; DESIGN;

D O I：

10.1145/3295500.3356162

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs.

引用

页数：81

共 18 条

[1] Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU
Abdelfattah, Ahmad
Keyes, David
Ltaief, Hatem
EURO-PAR 2012: PARALLEL PROCESSING WORKSHOPS, 2013, 7640 : 207 - 216
[2] PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications
Zhang, Lingqi
Wahib, Mohamed
Chen, Peng
Meng, Jintao
Wang, Xiao
Endo, Toshio
Matsuoka, Satoshi
PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ACM ICS 2023, 2023, : 167 - 179
[3] Analytic performance model for parallel overlapping memory-bound kernels
Afzal, Ayesha
Hager, Georg
Wellein, Gerhard
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (10):
[4] Scalable Kernel Fusion for Memory-Bound GPU Applications
Wahib, Mohamed
Maruyama, Naoya
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 191 - 202
[5] Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators
Abdelfattah, Ahmad
Dongarra, Jack
Keyes, David
Ltaief, Hatem
HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2012, 2013, 7851 : 72 - 79
[6] A practical performance model for compute and memory bound GPU kernels
Konstantinidis, Elias
Cotronis, Yiannis
23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 651 - 658
[7] Harvesting Memory-bound CPU Stall Cycles in Software with MSH
Luo, Zhihong
Son, Sam
Ratnasamy, Sylvia
Shenker, Scott
PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2024, 2024, : 57 - 75
[8] Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications
Orenes-Vera, Marcelo
Tureci, Esin
Wentzlaff, David
Martonosi, Margaret
2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 718 - 730
[9] Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives
Gowanlock, Michael
Karsin, Ben
Fink, Zane
Wright, Jordan
15TH INTERNATIONAL WORKSHOP ON DATA MANAGEMENT ON NEW HARDWARE (DAMON 2019), 2019,
[10] Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs
Mukunoki, Daichi
Imamura, Toshiyuki
Takahashi, Daisuke
2016 IEEE 10TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP (MCSOC), 2016, : 377 - 384

← 1 2 →