Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

被引:0
|
作者
Zhao, Zhixiang [1 ]
Zhang, Guoyin [1 ]
Wu, Yanxia [1 ]
Hong, Ruize [1 ]
Yang, Yiqing [1 ]
Fu, Yan [1 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期
关键词
SpMV; GPU; Mixed-precision; Block-wise; COMPUTING METHOD; SPMV; OPTIMIZATION; FORMAT;
D O I
10.1007/s11227-024-05949-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block's size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA's cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida's SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 2.91x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Turing RTX 2080Ti, and up to 2.99x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 3.22x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.
引用
收藏
页码:13681 / 13713
页数:33
相关论文
共 50 条
  • [31] Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
    Boukaram, Wajih
    Turkiyyah, George
    Keyes, David
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2019, 45 (01):
  • [32] Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
    AlAhmadi, Sarah
    Mohammed, Thaha
    Albeshri, Aiiad
    Katib, Iyad
    Mehmood, Rashid
    [J]. ELECTRONICS, 2020, 9 (10) : 1 - 30
  • [33] GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication
    Tao, Yuan
    Deng, Yangdong
    Mu, Shuai
    Zhang, Zhenzhong
    Zhu, Mingfa
    Xiao, Limin
    Ruan, Li
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (14): : 3771 - 3789
  • [34] Vector ISA extension for sparse matrix-vector multiplication
    Vassiliadis, S
    Cotofana, S
    Stathis, P
    [J]. EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 708 - 715
  • [35] Node aware sparse matrix-vector multiplication
    Bienz, Amanda
    Gropp, William D.
    Olson, Luke N.
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 130 : 166 - 178
  • [36] Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer
    DuBois, David
    DuBois, Andrew
    Connor, Carolyn
    Poole, Steve
    [J]. PROCEEDINGS OF THE SIXTEENTH IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, 2008, : 239 - +
  • [37] Understanding the performance of sparse matrix-vector multiplication
    Goumas, Georgios
    Kourtis, Kornilios
    Anastopoulos, Nikos
    Karakasis, Vasileios
    Koziris, Nectarios
    [J]. PROCEEDINGS OF THE 16TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2008, : 283 - +
  • [38] Sparse matrix-vector multiplication design on FPGAs
    Sun, Junqing
    Peterson, Gregory
    Storaasli, Olaf
    [J]. FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 349 - +
  • [39] STRUCTURED SPARSE MATRIX-VECTOR MULTIPLICATION ON A MASPAR
    DEHN, T
    EIERMANN, M
    GIEBERMANN, K
    SPERLING, V
    [J]. ZEITSCHRIFT FUR ANGEWANDTE MATHEMATIK UND MECHANIK, 1994, 74 (06): : T534 - T538
  • [40] Performance Aspects of Sparse Matrix-Vector Multiplication
    Simecek, I.
    [J]. ACTA POLYTECHNICA, 2006, 46 (03) : 3 - 8