Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores

被引：2

作者：

Sun, Yufei ^{[1
,2
]}

Zheng, Long ^{[1
,2
]}

Wang, Qinggang ^{[1
,2
]}

Ye, Xiangyu ^{[1
,2
]}

Huang, Yu ^{[1
,2
]}

Yao, Pengcheng ^{[1
,2
]}

Liao, Xiaofei ^{[1
]}

Jin, Hai ^{[1
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Natl Engn Res Ctr Big Data Technol & Syst, Serv Comp Technol & Syst Lab,Cluster & Grid Comp, Wuhan 430074, Peoples R China

[2] Zhejiang Lab, Hangzhou 311121, Peoples R China

来源：

2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC) | 2022年

关键词：

SpDNN; SpMM; Tensor Cores; PRODUCT;

D O I：

10.1109/HPEC55821.2022.9926300

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great opportunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half-precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of final results. To the best of our knowledge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41x and 8.12x speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version are also up to 6.56 x faster over the 2021 champion running on 4 GPUs and 7.55 x faster over the 2020 champion running on 768 GPUs.

引用

页数：7

共 50 条

[1] Accelerating Large Sparse Neural Network Inference Using GPU Task Graph Parallelism
Lin, Dian-Lun
Huang, Tsung-Wei
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 3041 - 3052
[2] SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPU
Jiang, Shui
Huang, Tsung-Wei
Yu, Bei
Ho, Tsung-Yi
[J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 51 - 61
[3] Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores
Zachariadis, Orestis
Satpute, Nitin
Gomez-Luna, Juan
Olivares, Joaquin
[J]. COMPUTERS & ELECTRICAL ENGINEERING, 2020, 88 (88)
[4] At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation
Hidayetoglu, Mert
Pearson, Carl
Mailthody, Vikram Sharma
Ebrahimi, Eiman
Xiong, Jinjun
Nagi, Rakesh
Hwu, Wen-mei
[J]. 2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[5] Tensorox: Accelerating GPU Applications via Neural Approximation on Unused Tensor Cores
Ho, Nhut-Minh
Wong, Weng-Fai
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (02) : 429 - 443
[6] Accelerating MLWorkloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
Hanindhito, Bagus
John, Lizy K.
[J]. PROCEEDINGS OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE 2024, 2024, : 178 - 189
[7] Quantum Perturbation Theory Using Tensor Cores and a Deep Neural Network
Finkelstein, Joshua
Rubensson, Emanuel H.
Mniszewski, Susan M.
Negre, Christian F. A.
Niklasson, Anders M. N.
[J]. JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2022, : 4255 - 4268
[8] Accelerating Simulated Quantum Annealing with GPU and Tensor Cores
Chung, Yi-Hua
Shih, Cheng-Jhih
Hung, Shih-Hao
[J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2022, 2022, 13289 : 174 - 191
[9] Accelerating Stencil Computations on a GPU by Combining Using Tensor Cores and Temporal Blocking
Kambe, Futa
Endo, Toshio
[J]. 16TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPU, GPGPU 2024, 2024, : 1 - 6
[10] Sparse Deep Neural Network Inference Using Different Programming Models
Lee, Hyungro
Jain, Milan
Ghosh, Sayan
[J]. 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING VIRTUAL CONFERENCE (HPEC), 2022,

← 1 2 3 4 5 →