A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU

被引:56
|
作者
Dziekonski, A. [1 ]
Lamecki, A. [1 ]
Mrozowski, M. [1 ]
机构
[1] Gdansk Univ Technol GUT, Fac Elect Telecommun & Informat ETI, WiComm Ctr Excellence, PL-80233 Gdansk, Poland
关键词
FINITE-ELEMENT-METHOD; FDTD METHOD; SCATTERING; ALGORITHM; UNITS;
D O I
10.2528/PIER11031607
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.
引用
收藏
页码:49 / 63
页数:15
相关论文
共 50 条
  • [1] Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU
    Yang, Carl
    Wang, Yangzihao
    Owens, John D.
    2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 2015, : 841 - 847
  • [2] A Fast Sparse Block Circulant Matrix Vector Product
    Romero, Eloy
    Tomas, Andres
    Soriano, Antonio
    Blanquer, Ignacio
    EURO-PAR 2014 PARALLEL PROCESSING, 2014, 8632 : 548 - 559
  • [3] Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU
    Gao, Jiaquan
    Qi, Panpan
    He, Guixia
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2016, 2016
  • [4] A GPU Framework for Sparse Matrix Vector Multiplication
    Neelima, B.
    Reddy, G. Ram Mohana
    Raghavendra, Prakash S.
    2014 IEEE 13TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2014, : 51 - 58
  • [5] Performance evaluation of sparse matrix-vector product (SpMV) computation on GPU architecture
    Kasmi, Najlae
    Mahmoudi, Sidi Ahmed
    Zbakh, Mostapha
    Manneback, Pierre
    2014 SECOND WORLD CONFERENCE ON COMPLEX SYSTEMS (WCCS), 2014, : 23 - 27
  • [6] An Efficient Sparse Matrix Multiplication for skewed matrix on GPU
    Shah, Monika
    Patel, Vibha
    2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 1301 - 1306
  • [7] Efficient Sparse-Matrix Multi-Vector Product on GPUs
    Hong, Changwan
    Sukumaran-Rajam, Aravind
    Bandyopadhyay, Bortik
    Kim, Jinsung
    Kurt, Sureyya Emre
    Nisa, Israt
    Sabhlok, Shivani
    Catalyurek, Umit V.
    Parthasarathy, Srinivasan
    Sadayappan, P.
    HPDC '18: PROCEEDINGS OF THE 27TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2018, : 66 - 79
  • [8] GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication
    Tao, Yuan
    Deng, Yangdong
    Mu, Shuai
    Zhang, Zhenzhong
    Zhu, Mingfa
    Xiao, Limin
    Ruan, Li
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (14): : 3771 - 3789
  • [9] Generating optimal CUDA sparse matrix-vector product implementations for evolving GPU hardware
    El Zein, Ahmed H.
    Rendell, Alistair P.
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2012, 24 (01): : 3 - 13
  • [10] Improving the locality of the sparse matrix-vector product on shared memory multiprocessors
    Pichel, JC
    Heras, DB
    Cabaleiro, JC
    Rivera, FF
    12TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2004, : 66 - 71