Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

被引：8

作者：

Shi, Shaohuai ^{[1
]}

Wang, Qiang ^{[2
]}

Chu, Xiaowen ^{[2
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China

来源：

2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2020年

关键词：

Sparse Matrix Multiplication; COO; GCOO; GPU;

D O I：

10.1109/ICPADS51040.2020.00013

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8 x speedup over Nvidia's library cuSPARSE in many matrices.

引用

页码：19 / 26

页数：8

共 50 条

[1] Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
Koanantakool, Penporn
Azad, Ariful
Buluc, Aydin
Morozov, Dmitriy
Oh, Sang-Yun
Oliker, Leonid
Yelick, Katherine
[J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 842 - 853
[2] An Efficient Sparse-Dense Matrix Multiplication on a Multicore System
Yan, Di
Wu, Tao
Liu, Ying
Gao, Yang
[J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1880 - 1883
[3] Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format
Greathouse, Joseph L.
Daga, Mayank
[J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 769 - 780
[4] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
Liu, Weifeng
Vinter, Brian
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
[5] Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs
Wei, Bingxin
Wang, Yizhuo
Chang, Fangli
Gao, Jianhua
Ji, Weixing
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2024, 38 (03): : 245 - 259
[6] Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms
Patwary, Md. Mostofa Ali
Satish, Nadathur Rajagopalan
Sundaram, Narayanan
Park, Jongsoo
Anderson, Michael J.
Vadlamudi, Satya Gautam
Das, Dipankar
Pudov, Sergey G.
Pirogov, Vadim O.
Dubey, Pradeep
[J]. HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 : 48 - 57
[7] SDMA: An Efficient and Flexible Sparse-Dense Matrix-Multiplication Architecture for GNNs
Gao, Yingxue
Gong, Lei
Wang, Chao
Wang, Teng
Zhou, Xuehai
[J]. 2022 32ND INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2022, : 307 - 312
[8] Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs
Liu, Junhong
He, Xin
Liu, Weifeng
Tan, Guangming
[J]. ACM SIGPLAN NOTICES, 2018, 53 (01) : 407 - 408
[9] TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs
Niu, Yuyao
Lu, Zhengyang
Ji, Haonan
Song, Shuhui
Jin, Zhou
Liu, Weifeng
[J]. PPOPP'22: PROCEEDINGS OF THE 27TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2022, : 90 - 106
[10] Hypergraph partitioning for sparse matrix-matrix multiplication
Ballard G.
Druinsky A.
Knight N.
Schwartz O.
[J]. ACM Transactions on Parallel Computing, 2016, 3 (03) : 1 - 34

← 1 2 3 4 5 →