GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

被引：8

作者：

Tao, Yuan ^{[1
,2
,3
]}

Deng, Yangdong ^{[4
]}

Mu, Shuai ^{[4
]}

Zhang, Zhenzhong ^{[1
,2
]}

Zhu, Mingfa ^{[1
,2
]}

Xiao, Limin ^{[1
,2
]}

Ruan, Li ^{[1
,2
]}

机构：

[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China

[3] Jilin Normal Univ, Coll Math, Siping 136000, Jilin, Peoples R China

[4] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2015年 / 27卷 / 14期

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

sparse matrix-transpose vector product; sparse matrix-vector product; compressed sparse block; CSB; compressed sparse rows; CSR; GPU;

D O I：

10.1002/cpe.3415

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Many high performance computing applications require computing both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluc et al. allows computing both problems on multi-core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU-based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall-clock time of bi-conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright (C) 2014 John Wiley & Sons, Ltd.

引用

页码：3771 / 3789

页数：19

共 50 条

[1] Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks
Buluc, Aydin
Fineman, Jeremy T.
Frigo, Matteo
Gilbert, John R.
Leiserson, Charles E.
[J]. SPAA'09: PROCEEDINGS OF THE TWENTY-FIRST ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, 2009, : 233 - 244
[2] Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU
Zhang, Jilin
Liu, Enyi
Wan, Jian
Ren, Yongjian
Yue, Miao
Wang, Jue
[J]. APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 473 - 482
[3] Energy Evaluation of Sparse Matrix-Vector Multiplication on GPU
Benatia, Akrem
Ji, Weixing
Wang, Yizhuo
Shi, Feng
[J]. 2016 SEVENTH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2016,
[4] A New Method of Sparse Matrix-Vector Multiplication on GPU
Huan, Gao
Qian, Zhang
[J]. PROCEEDINGS OF 2012 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2012), 2012, : 954 - 958
[5] Adaptive diagonal sparse matrix-vector multiplication on GPU
Gao, Jiaquan
Xia, Yifei
Yin, Renjie
He, Guixia
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 157 : 287 - 302
[6] atomic reduction based sparse matrix-transpose vector multiplication on GPUs
Tao, Yuan
Deng, Yangdong
Mu, Shuai
Zhu, Mingfa
Xiao, Limin
Ruan, Li
Huang, Zhibin
[J]. 2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2014, : 987 - 992
[7] Sparse Matrix-Vector Multiplication on GPGPUs
Filippone, Salvatore
Cardellini, Valeria
Barbieri, Davide
Fanfarillo, Alessandro
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2017, 43 (04):
[8] Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Monakov, Alexander
Lokhmotov, Anton
Avetisyan, Arutyun
[J]. HIGH PERFORMANCE EMBEDDED ARCHITECTURES AND COMPILERS, PROCEEDINGS, 2010, 5952 : 111 - +
[9] Vector ISA extension for sparse matrix-vector multiplication
Vassiliadis, S
Cotofana, S
Stathis, P
[J]. EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 708 - 715
[10] Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication
Nguyen Quang Anh Pham
Fan, Rui
Wen, Yonggang
[J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 1043 - 1052

← 1 2 3 4 5 →