Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

被引:0
|
作者
Zhao, Zhixiang [1 ]
Zhang, Guoyin [1 ]
Wu, Yanxia [1 ]
Hong, Ruize [1 ]
Yang, Yiqing [1 ]
Fu, Yan [1 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期
关键词
SpMV; GPU; Mixed-precision; Block-wise; COMPUTING METHOD; SPMV; OPTIMIZATION; FORMAT;
D O I
10.1007/s11227-024-05949-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block's size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA's cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida's SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 2.91x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Turing RTX 2080Ti, and up to 2.99x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 3.22x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.
引用
收藏
页码:13681 / 13713
页数:33
相关论文
共 50 条
  • [11] Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs
    Ahmad, Khalid
    Sundar, Hari
    Hall, Mary
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 16 (04)
  • [12] Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs
    Pichel, Juan C.
    Rivera, Francisco F.
    Fernandez, Marcos
    Rodriguez, Aurelio
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 2012, 36 (02) : 65 - 77
  • [13] A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs
    He, Guixia
    Gao, Jiaquan
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2016, 2016
  • [14] Optimizing Sparse Matrix-Vector Multiplication on GPUs via Index Compression
    Sun, Xue
    Wei, Kai-Cheng
    Lai, Lien-Fu
    Tsai, Sung-Han
    Wu, Chao-Chin
    [J]. PROCEEDINGS OF 2018 IEEE 3RD ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC 2018), 2018, : 598 - 602
  • [15] Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs
    Yoshizawa, Hiroki
    Takahashi, Daisuke
    [J]. 15TH IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2012) / 10TH IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC 2012), 2012, : 130 - 136
  • [16] Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining
    Yang, Xintian
    Parthasarathy, Srinivasan
    Sadayappan, P.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (04): : 231 - 242
  • [17] CoAdELL: Adaptivity and Compression for Improving Sparse Matrix-Vector Multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    [J]. PROCEEDINGS OF 2014 IEEE INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2014, : 934 - 941
  • [18] An architecture-aware technique for optimizing sparse matrix-vector multiplication on GPUs
    Maggioni, Marco
    Berger-Wolf, Tanya
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, 2013, 18 : 329 - 338
  • [19] Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format
    Greathouse, Joseph L.
    Daga, Mayank
    [J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 769 - 780
  • [20] Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs
    Zeng, Guangsen
    Zou, Yi
    [J]. ELECTRONICS, 2023, 12 (17)