GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

被引:8
|
作者
Tao, Yuan [1 ,2 ,3 ]
Deng, Yangdong [4 ]
Mu, Shuai [4 ]
Zhang, Zhenzhong [1 ,2 ]
Zhu, Mingfa [1 ,2 ]
Xiao, Limin [1 ,2 ]
Ruan, Li [1 ,2 ]
机构
[1] Beihang Univ, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China
[3] Jilin Normal Univ, Coll Math, Siping 136000, Jilin, Peoples R China
[4] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China
来源
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
sparse matrix-transpose vector product; sparse matrix-vector product; compressed sparse block; CSB; compressed sparse rows; CSR; GPU;
D O I
10.1002/cpe.3415
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Many high performance computing applications require computing both sparse matrix-vector product (SMVP) and sparse matrix-transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluc et al. allows computing both problems on multi-core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU-based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall-clock time of bi-conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright (C) 2014 John Wiley & Sons, Ltd.
引用
收藏
页码:3771 / 3789
页数:19
相关论文
共 50 条
  • [1] Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks
    Buluc, Aydin
    Fineman, Jeremy T.
    Frigo, Matteo
    Gilbert, John R.
    Leiserson, Charles E.
    [J]. SPAA'09: PROCEEDINGS OF THE TWENTY-FIRST ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, 2009, : 233 - 244
  • [2] Implementing Sparse Matrix-Vector Multiplication with QCSR on GPU
    Zhang, Jilin
    Liu, Enyi
    Wan, Jian
    Ren, Yongjian
    Yue, Miao
    Wang, Jue
    [J]. APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 473 - 482
  • [3] Energy Evaluation of Sparse Matrix-Vector Multiplication on GPU
    Benatia, Akrem
    Ji, Weixing
    Wang, Yizhuo
    Shi, Feng
    [J]. 2016 SEVENTH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2016,
  • [4] A New Method of Sparse Matrix-Vector Multiplication on GPU
    Huan, Gao
    Qian, Zhang
    [J]. PROCEEDINGS OF 2012 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2012), 2012, : 954 - 958
  • [5] Adaptive diagonal sparse matrix-vector multiplication on GPU
    Gao, Jiaquan
    Xia, Yifei
    Yin, Renjie
    He, Guixia
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 157 : 287 - 302
  • [6] atomic reduction based sparse matrix-transpose vector multiplication on GPUs
    Tao, Yuan
    Deng, Yangdong
    Mu, Shuai
    Zhu, Mingfa
    Xiao, Limin
    Ruan, Li
    Huang, Zhibin
    [J]. 2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2014, : 987 - 992
  • [7] Sparse Matrix-Vector Multiplication on GPGPUs
    Filippone, Salvatore
    Cardellini, Valeria
    Barbieri, Davide
    Fanfarillo, Alessandro
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2017, 43 (04):
  • [8] Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
    Monakov, Alexander
    Lokhmotov, Anton
    Avetisyan, Arutyun
    [J]. HIGH PERFORMANCE EMBEDDED ARCHITECTURES AND COMPILERS, PROCEEDINGS, 2010, 5952 : 111 - +
  • [9] Vector ISA extension for sparse matrix-vector multiplication
    Vassiliadis, S
    Cotofana, S
    Stathis, P
    [J]. EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 708 - 715
  • [10] Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication
    Nguyen Quang Anh Pham
    Fan, Rui
    Wen, Yonggang
    [J]. 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 1043 - 1052