Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication

被引:5
|
作者
Nguyen Quang Anh Pham [1 ]
Fan, Rui [1 ]
Wen, Yonggang [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore, Singapore
关键词
D O I
10.1109/IPDPS.2015.100
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrix-vector multiplication (SpMV) is an important kernel used in solving many scientific and engineering problems. The massive parallelism of graphics processing units (GPUs) makes them well suited for SpMV computations. However, fully utilizing the power of GPUs is challenging because SpMV makes a large number of scattered memory accesses which saturate the GPU's memory bandwidth. Most previous works sought to address the bandwidth limitation by using efficient storage formats for the matrix. However, we show that for most matrices, a majority of the bandwidth is consumed by accesses to the vector. In this paper, we introduce two techniques to significantly decrease the I/O for vector accesses, by making novel use of the GPU's fast shared memory. A key advantage of our vector optimizations is that they are complementary to existing matrix I/O optimizations, so that it is possible to use both techniques in conjunction. Furthermore, combining the optimizations requires only minor code changes. We demonstrate how to combine our techniques with the widely used CUSP SpMV algorithm and the currently highest performing yaSpMV algorithm to significantly improve both algorithms' performance. We experimented with a wide range of matrices, and show that the modified version of CUSP on average reduces vector I/O by 37% and reduces the total I/O by 31%, while the modified version of yaSpMV reduces the vector and total I/O by 36% and 31%, resp. We improve CUSP's total throughput by 14% on average and up to 77% for certain matrices, and improve yaSpMV's throughput by 12% on average and 35% for some matrices.
引用
收藏
页码:1043 / 1052
页数:10
相关论文
共 50 条
  • [41] No Zero Padded Sparse Matrix-Vector Multiplication on FPGAs
    Huang, Jiasen
    Ren, Junyan
    Yin, Wenbo
    Wang, Lingli
    PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT), 2014, : 290 - 291
  • [42] Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer with Application
    Dubois, David
    Dubois, Andrew
    Boorman, Thomas
    Connor, Carolyn
    Poole, Steve
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2010, 3 (01)
  • [43] Sparse Binary Matrix-Vector Multiplication on Neuromorphic Computers
    Schuman, Catherine D.
    Kay, Bill
    Date, Prasanna
    Kannan, Ramakrishnan
    Sao, Piyush
    Potok, Thomas E.
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 308 - 311
  • [44] Optimization techniques for sparse matrix-vector multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 : 66 - 86
  • [45] LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows
    Liu, Yongchao
    Schmidt, Bertil
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (01): : 69 - 86
  • [46] Acceleration of Sparse Matrix-Vector Multiplication by Region Traversal
    Simecek, I.
    ACTA POLYTECHNICA, 2008, 48 (04) : 8 - 15
  • [47] IMAGE EDITING BASED ON SPARSE MATRIX-VECTOR MULTIPLICATION
    Wang, Ying
    Yan, Hongping
    Pan, Chunhong
    Xiang, Shiming
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 1317 - 1320
  • [48] Processor-efficient sparse matrix-vector multiplication
    Heath, LS
    Ribbens, CJ
    Pemmaraju, SV
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2004, 48 (3-4) : 589 - 608
  • [49] High performance sparse matrix-vector multiplication on FPGA
    Zou, Dan
    Dou, Yong
    Guo, Song
    Ni, Shice
    IEICE ELECTRONICS EXPRESS, 2013, 10 (17):
  • [50] LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows
    Yongchao Liu
    Bertil Schmidt
    Journal of Signal Processing Systems, 2018, 90 : 69 - 86