Optimizing GPU Memory Transactions for Convolution Operations

被引：5

作者：

Lu, Gangzhao ^{[1
]}

Zhang, Weizhe ^{[1
]}

Wang, Zheng ^{[2
]}

机构：

[1] Harbin Inst Technol, Comp Sci & Technol, Harbin, Peoples R China

[2] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) | 2020年

关键词：

Performance Optimization; Convolution; Memory Optimization; GPUs;

D O I：

10.1109/CLUSTER49012.2020.00050

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Convolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPu. For 2D convolution, our approach delivers over 2x faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to 1.3x speedups over the quickest algorithm of cuDNN.

引用

页码：399 / 403

页数：5

共 50 条

[41] Optimizing strided remote memory access operations on the Quadrics QsNetII network interconnect
Nieplocha, Jarek
Tipparaju, Vinod
Krishnan, Manoj
EIGHTH INTERNATIONAL CONFERENCE ON HIGH-PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION, PROCEEDINGS, 2005, : 28 - 35
[42] Optimizing GPU Energy Efficiency with 3D Die-Stacking Graphics Memory and Reconfigurable Memory Interface
Zhao, Jishen
Sun, Guangyu
Loh, Gabriel H.
Xie, Yuan
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2013, 10 (04)
[43] GPU accelerated radio astronomy signal convolution
Harris, Chris
Haines, Karen
Staveley-Smith, Lister
EXPERIMENTAL ASTRONOMY, 2008, 22 (1-2) : 129 - 141
[44] Research on Winograd Convolution Acceleration for Modern GPU
Tong G.
Huang L.-B.
Lyu Y.-S.
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2024, 52 (01): : 244 - 257
[45] FFT and convolution performance in image filtering on GPU
Fialka, Ondrej
Cadik, Martin
INFORMATION VISUALIZATION-BOOK, 2006, : 609 - +
[46] GPU accelerated radio astronomy signal convolution
Chris Harris
Karen Haines
Lister Staveley-Smith
Experimental Astronomy, 2008, 22 : 129 - 141
[47] Brief Announcement: Optimizing Persistent Transactions
Zhou, Tingzhe
Zardoshti, Pantea
Spear, Michael F.
SPAA'19: PROCEEDINGS OF THE 31ST ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURESS, 2019, 2019, : 169 - 170
[48] Real-time massive convolution for audio applications on GPUMassive convolution on GPU
Jose A. Belloch
Alberto Gonzalez
F. J. Martínez-Zaldívar
Antonio M. Vidal
The Journal of Supercomputing, 2011, 58 : 449 - 457
[49] A general purpose contention manager for software transactions on the GPU
Shen, Qi
Sharp, Craig
Davison, Richard
Ushaw, Gary
Ranjan, Rajiv
Zomaya, Albert Y.
Morgan, Graham
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 139 (139) : 1 - 17
[50] OPTIMIZING PRODUCTION OPERATIONS
SCHMIDT, B
WUNSCHE, H
PLAPP, C
CHEMIE INGENIEUR TECHNIK, 1995, 67 (09) : 1058 - 1058

← 1 2 3 4 5 →