A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on GPUs

被引:11
|
作者
Ashari, Arash [1 ]
Sedaghati, Naser [1 ]
Eisenlohr, John [1 ]
Sadayappan, P. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
关键词
SpMV; GPU; CUDA;
D O I
10.1016/j.jpdc.2014.11.001
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs. In this paper we present a new Blocked Row-Column (BRC) storage format with a two-dimensional blocking mechanism that addresses these challenges effectively. It reduces thread divergence by reordering and blocking rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an approach to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. A CUDA implementation of BRC outperforms NVIDIA CUSP and cuSPARSE libraries and other stateof-the-art SpMV formats on a range of unstructured sparse matrices from multiple application domains. The BRC format has been integrated with PETSc, enabling its use in PETSc's solvers. Furthermore, when partitioning the input matrix, BRC achieves near linear speedup on multiple GPUs. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:3 / 15
页数:13
相关论文
共 50 条
  • [1] Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs
    Choi, Jee W.
    Singh, Amik
    Vuduc, Richard W.
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (05) : 115 - 125
  • [2] Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs
    Choi, Jee W.
    Singh, Amik
    Vuduc, Richard W.
    [J]. PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, : 115 - 125
  • [3] An Efficient Two-Dimensional Blocking Strategy for Sparse Matrix-Vector Multiplication on GPUs
    Ashari, Arash
    Sedaghati, Naser
    Eisenlohr, John
    Sadayappan, P.
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, (ICS'14), 2014, : 273 - 282
  • [4] Load-balanced sparse matrix-vector multiplication on parallel computers
    Nastea, SG
    Frieder, O
    El-Ghazawi, T
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 46 (02) : 180 - 193
  • [5] Optimization techniques for sparse matrix-vector multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 : 66 - 86
  • [6] Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
    Monakov, Alexander
    Avetisyan, Arutyun
    [J]. EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION, PROCEEDINGS, 2009, 5657 : 289 - 297
  • [7] Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs
    Tanabe, Noboru
    Ogawa, Yuuka
    Takata, Masami
    Joe, Kazuki
    [J]. PROCEEDINGS OF THE 19TH INTERNATIONAL EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2011, : 101 - 108
  • [8] Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs
    Feng, Xiaowen
    Jin, Hai
    Zheng, Ran
    Hu, Kan
    Zeng, Jingxiang
    Shao, Zhiyuan
    [J]. 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 165 - 172
  • [9] Multiple-precision sparse matrix-vector multiplication on GPUs
    Isupov, Konstantin
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
  • [10] Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA
    Nurudin Alvarez, Francisco
    Antonio Ortega-Toro, Jose
    Ujaldon, Manuel
    [J]. HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 219 - 229