Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores

被引:2
|
作者
Sun, Yufei [1 ,2 ]
Zheng, Long [1 ,2 ]
Wang, Qinggang [1 ,2 ]
Ye, Xiangyu [1 ,2 ]
Huang, Yu [1 ,2 ]
Yao, Pengcheng [1 ,2 ]
Liao, Xiaofei [1 ]
Jin, Hai [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Serv Comp Technol & Syst Lab,Cluster & Grid Comp, Wuhan 430074, Peoples R China
[2] Zhejiang Lab, Hangzhou 311121, Peoples R China
关键词
SpDNN; SpMM; Tensor Cores; PRODUCT;
D O I
10.1109/HPEC55821.2022.9926300
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great opportunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half-precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of final results. To the best of our knowledge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41x and 8.12x speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version are also up to 6.56 x faster over the 2021 champion running on 4 GPUs and 7.55 x faster over the 2020 champion running on 768 GPUs.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Accelerating Large Sparse Neural Network Inference Using GPU Task Graph Parallelism
    Lin, Dian-Lun
    Huang, Tsung-Wei
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 3041 - 3052
  • [2] SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPU
    Jiang, Shui
    Huang, Tsung-Wei
    Yu, Bei
    Ho, Tsung-Yi
    [J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 51 - 61
  • [3] Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores
    Zachariadis, Orestis
    Satpute, Nitin
    Gomez-Luna, Juan
    Olivares, Joaquin
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2020, 88 (88)
  • [4] At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation
    Hidayetoglu, Mert
    Pearson, Carl
    Mailthody, Vikram Sharma
    Ebrahimi, Eiman
    Xiong, Jinjun
    Nagi, Rakesh
    Hwu, Wen-mei
    [J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [5] Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores
    Ho, Nhut-Minh
    Wong, Weng-Fai
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (02) : 429 - 443
  • [6] Accelerating MLWorkloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
    Hanindhito, Bagus
    John, Lizy K.
    [J]. PROCEEDINGS OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE 2024, 2024, : 178 - 189
  • [7] Quantum Perturbation Theory Using Tensor Cores and a Deep Neural Network
    Finkelstein, Joshua
    Rubensson, Emanuel H.
    Mniszewski, Susan M.
    Negre, Christian F. A.
    Niklasson, Anders M. N.
    [J]. JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2022, : 4255 - 4268
  • [8] Accelerating Simulated Quantum Annealing with GPU and Tensor Cores
    Chung, Yi-Hua
    Shih, Cheng-Jhih
    Hung, Shih-Hao
    [J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 174 - 191
  • [9] Accelerating Stencil Computations on a GPU by Combining Using Tensor Cores and Temporal Blocking
    Kambe, Futa
    Endo, Toshio
    [J]. 16TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPU, GPGPU 2024, 2024, : 1 - 6
  • [10] Sparse Deep Neural Network Inference Using Different Programming Models
    Lee, Hyungro
    Jain, Milan
    Ghosh, Sayan
    [J]. 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,