Optimizing Depthwise Separable Convolution Operations on GPUs

被引：21

作者：

Lu, Gangzhao ^{[1
]}

Zhang, Weizhe ^{[1
]}

Wang, Zheng ^{[2
]}

机构：

[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150000, Peoples R China

[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;

D O I：

10.1109/TPDS.2021.3084813

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.

引用

页码：70 / 87

页数：18

共 50 条

[31] Traffic flow prediction based on depthwise separable convolution fusion network
Yu, Yue
Sun, Wei
Liu, Jianhua
Zhang, Changfan
JOURNAL OF BIG DATA, 2022, 9 (01)
[32] An FPGA-Based Approach for Compressing and Accelerating Depthwise Separable Convolution
Yang R.
Chen Z.
Hu L.
Cui X.
Guo Y.
IEEE Signal Processing Letters, 2024, 31 : 1 - 5
[33] A digital signal processor-efficient accelerator for depthwise separable convolution
Li, Xueming
Huang, Hongmin
Liu, Yuan
Hu, Xianghong
Xiong, Xiaoming
ELECTRONICS LETTERS, 2022, 58 (07) : 271 - 273
[34] An improved architecture for urban building extraction based on depthwise separable convolution
Zhang, Xiaoqing
Zheng, Yongguo
Liu, Weike
Peng, Yanjun
Wang, Zhiyong
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 38 (05) : 5821 - 5829
[35] Efficient depthwise separable convolution accelerator for classification and UAV object detection
Li, Guoqing
Zhang, Jingwei
Zhang, Meng
Wu, Ruixia
Cao, Xinye
Liu, Wenzhao
NEUROCOMPUTING, 2022, 490 : 1 - 16
[36] Lightweight Residual Network Based on Depthwise Separable Convolution for Hyperspectral Image Classification
Cheng Rongjie
Yang Yun
Li Longwei
Wang Yanting
Wang Jiayu
ACTA OPTICA SINICA, 2023, 43 (12)
[37] Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3
Yue-Yan Qin
Jiang-Tao Cao
Xiao-Fei Ji
International Journal of Automation and Computing, 2021, 18 : 300 - 310
[38] SepUnet: DEPTHWISE SEPARABLE CONVOLUTION INTEGRATED U-NET FOR MRI RECONSTRUCTION
Zabihi, Soheil
Rahimian, Elahe
Asif, Amir
Mohammadi, Arash
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 3792 - 3796
[39] ADSCNet: asymmetric depthwise separable convolution for semantic segmentation in real-time
Jiawei Wang
Hongyun Xiong
Haibo Wang
Xiaohong Nian
Applied Intelligence, 2020, 50 : 1045 - 1056
[40] Low-Power Hardware Architecture for Depthwise Separable Convolution Unit Design
Lin, Shi-Rou
Lin, Wei-Hung
Huang, Shih-Hsu
Hsu, Chun-Lung
Sun, Chitien
2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,

← 1 2 3 4 5 →