Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores

被引:14
|
作者
Kim, Hyeonjin [1 ]
Ahn, Sungwoo [1 ]
Oh, Yunho [2 ]
Kim, Bogil [1 ]
Ro, Won Woo [1 ]
Song, William J. [1 ]
机构
[1] Yonsei Univ, Sch Elect & Elect Engn, Seoul, South Korea
[2] Ecole Polytech Fed Lausanne EPFL, EcoCloud, Lausanne, Vaud, Switzerland
关键词
Deep Neural Network; GPU; Tensor Core;
D O I
10.1109/MICRO50266.2020.00065
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a GPU architecture named Duplo that minimizes redundant memory accesses of convolutions in deep neural networks (DNNs). Convolution is one of the fundamental operations used in various classes of DNNs, and it takes the majority of execution time. Various approaches have been proposed to accelerate convolutions via general matrix multiplication (GEMM), Winograd convolution, fast Fourier transform (FFT), etc. Recent introduction of tensor cores in NVIDIA GPUs particularly targets on accelerating neural network computations. A tensor core in a streaming multiprocessor (SM) is a specialized unit dedicated to handling matrix-multiply-and-accumulate (MMA) operations. The underlying operations of tensor cores represent GEMM calculations, and lowering a convolution can effectively exploit the tensor cores by transforming deeply nested convolution loops into matrix multiplication. However, lowering the convolution has a critical drawback since it requires a larger memory space (or workspace) to compute the matrix multiplication, where the expanded workspace inevitably creates multiple duplicates of the same data stored at different memory addresses. The proposed Duplo architecture tackles this challenge by leveraging compile-time information and microarchitectural supports to detect and eliminate redundant memory accesses that repeatedly load the duplicates of data in the workspace matrix. Duplo identifies data duplication based on memory addresses and convolution information generated by a compiler. It uses a load history buffer (LHB) to trace the recent load history of workspace data and their presence in register file. Every load instruction of workspace data refers to the LHB to find if potentially the same copies of data exist in the register file. If data duplicates are found, Duplo simply renames registers and makes them point to the ones containing the same values instead of issuing memory requests to load the same data. Our experiment results show that Duplo improves the performance of DNNs by 29.4% on average and saves 34.1% of energy using tensor cores.
引用
收藏
页码:725 / 737
页数:13
相关论文
共 50 条
  • [1] Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores
    Sun, Yufei
    Zheng, Long
    Wang, Qinggang
    Ye, Xiangyu
    Huang, Yu
    Yao, Pengcheng
    Liao, Xiaofei
    Jin, Hai
    2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,
  • [2] APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores
    Feng, Boyuan
    Wang, Yuke
    Geng, Tong
    Li, Ang
    Ding, Yufei
    SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,
  • [3] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
    Wang L.
    Ye J.
    Zhao Y.
    Wu W.
    Li A.
    Song S.L.
    Xu Z.
    Kraska T.
    2018, Association for Computing Machinery, 2 Penn Plaza, Suite 701, New York, NY 10121-0701, United States (53): : 41 - 53
  • [4] SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
    Wang, Linnan
    Ye, Jinmian
    Zhao, Yiyang
    Wu, Wei
    Li, Ang
    Song, Shuaiwen Leon
    Xu, Zenglin
    Kraska, Tim
    ACM SIGPLAN NOTICES, 2018, 53 (01) : 41 - 53
  • [5] Dynamic Memory Management for GPU-based training of Deep Neural Networks
    Shriram, S. B.
    Garg, Anshuj
    Kulkarni, Purushottam
    2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 200 - 209
  • [6] Features of the construction photonic tensor cores for neural networks
    Popovskiy, N. I.
    Davydov, V. V.
    Rud, V. Yu.
    ST PETERSBURG POLYTECHNIC UNIVERSITY JOURNAL-PHYSICS AND MATHEMATICS, 2023, 16 (03): : 81 - 86
  • [7] Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores
    Ho, Nhut-Minh
    Wong, Weng-Fai
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (02) : 429 - 443
  • [8] AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Neural Networks
    Guo, Jinrong
    Liu, Wantao
    Wang, Wang
    Yao, Chunrong
    Han, Jizhong
    Li, Ruixuan
    Lu, Yijun
    Hu, Songlin
    2019 IEEE 37TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2019), 2019, : 65 - 72
  • [9] Capuchin: Tensor-based GPU Memory Management for Deep Learning
    Peng, Xuan
    Shi, Xuanhua
    Dai, Hulin
    Jin, Hai
    Ma, Weiliang
    Xiong, Qian
    Yang, Fan
    Qian, Xuehai
    TWENTY-FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXV), 2020, : 891 - 905
  • [10] Variational tensor neural networks for deep learning
    Jahromi, Saeed S.
    Orus, Roman
    SCIENTIFIC REPORTS, 2024, 14 (01):