Optimizing Depthwise Separable Convolution Operations on GPUs

被引:21
|
作者
Lu, Gangzhao [1 ]
Zhang, Weizhe [1 ]
Wang, Zheng [2 ]
机构
[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150000, Peoples R China
[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England
基金
中国国家自然科学基金;
关键词
Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;
D O I
10.1109/TPDS.2021.3084813
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [1] Optimizing depthwise separable convolution on DCUOptimizing depthwise separable convolution on DCU...Z. Liu et al.
    Zheng Liu
    Meng Hao
    Weizhe Zhang
    Gangzhao Lu
    Xueyang Tian
    Siyu Yang
    Mingdong Xie
    Jie Dai
    Chenyu Yuan
    Desheng Wang
    Hongwei Yang
    CCF Transactions on High Performance Computing, 2024, 6 (6) : 646 - 664
  • [2] Optimizing Image Classification with Inverse Depthwise Separable Convolution for Edge Devices
    Sharma, Akshay Kumar
    Kim, Kyung Ki
    2023 20TH INTERNATIONAL SOC DESIGN CONFERENCE, ISOCC, 2023, : 211 - 212
  • [3] Optimizing convolution operations on GPUs using adaptive tiling
    van Werkhovena, Ben
    Maassen, Jason
    Bal, Henri E.
    Seinstra, Frank J.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 30 : 14 - 26
  • [4] Falcon: lightweight and accurate convolution based on depthwise separable convolution
    Jang, Jun-Gi
    Quan, Chun
    Lee, Hyun Dong
    Kang, U.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (05) : 2225 - 2249
  • [5] Falcon: lightweight and accurate convolution based on depthwise separable convolution
    Jun-Gi Jang
    Chun Quan
    Hyun Dong Lee
    U. Kang
    Knowledge and Information Systems, 2023, 65 : 2225 - 2249
  • [6] A Depthwise Separable Convolution Architecture for CNN Accelerator
    Srivastava, Harsh
    Sarawadekar, Kishor
    PROCEEDINGS OF 2020 IEEE APPLIED SIGNAL PROCESSING CONFERENCE (ASPCON 2020), 2020, : 1 - 5
  • [7] Fast Depthwise Separable Convolution for Embedded Systems
    Yoo, Byeongheon
    Choi, Yongjun
    Choi, Heeyoul
    NEURAL INFORMATION PROCESSING (ICONIP 2018), PT VII, 2018, 11307 : 656 - 665
  • [8] Mobile-X: Dedicated FPGA Implementation of the MobileNet Accelerator Optimizing Depthwise Separable Convolution
    Hong, Hyeonseok
    Choi, Dahun
    Kim, Namjoon
    Kim, Hyun
    IEEE Transactions on Circuits and Systems II: Express Briefs, 2024, 71 (11) : 4668 - 4672
  • [9] Load Prediction Based on Depthwise Separable Convolution Model
    Zhang, Kui
    Zhai, Suwei
    Lu, Hai
    2021 4TH INTERNATIONAL CONFERENCE ON MECHATRONICS, ROBOTICS AND AUTOMATION (ICMRA 2021), 2020, : 75 - 79
  • [10] A CNN Accelerator on FPGA Using Depthwise Separable Convolution
    Bai, Lin
    Zhao, Yiming
    Huang, Xinming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2018, 65 (10) : 1415 - 1419