Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

被引：0

作者：

Zhao, Zhixiang ^{[1
]}

Zhang, Guoyin ^{[1
]}

Wu, Yanxia ^{[1
]}

Hong, Ruize ^{[1
]}

Yang, Yiqing ^{[1
]}

Fu, Yan ^{[1
]}

机构：

[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期

关键词：

SpMV; GPU; Mixed-precision; Block-wise; COMPUTING METHOD; SPMV; OPTIMIZATION; FORMAT;

D O I：

10.1007/s11227-024-05949-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block's size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA's cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida's SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 2.91x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Turing RTX 2080Ti, and up to 2.99x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 3.22x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.

引用

页码：13681 / 13713

页数：33

共 50 条

[1] Multiple-precision sparse matrix-vector multiplication on GPUs
Isupov, Konstantin
[J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
[2] Optimization techniques for sparse matrix-vector multiplication on GPUs
Maggioni, Marco
Berger-Wolf, Tanya
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 93-94 : 66 - 86
[3] Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
Monakov, Alexander
Avetisyan, Arutyun
[J]. EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION, PROCEEDINGS, 2009, 5657 : 289 - 297
[4] Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs
Tanabe, Noboru
Ogawa, Yuuka
Takata, Masami
Joe, Kazuki
[J]. PROCEEDINGS OF THE 19TH INTERNATIONAL EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2011, : 101 - 108
[5] Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs
Feng, Xiaowen
Jin, Hai
Zheng, Ran
Hu, Kan
Zeng, Jingxiang
Shao, Zhiyuan
[J]. 2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 165 - 172
[6] Dense and Sparse Matrix-Vector Multiplication on Maxwell GPUs with PyCUDA
Nurudin Alvarez, Francisco
Antonio Ortega-Toro, Jose
Ujaldon, Manuel
[J]. HIGH PERFORMANCE COMPUTING CARLA 2016, 2017, 697 : 219 - 229
[7] Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications
Ashari, Arash
Sedaghati, Naser
Eisenlohr, John
Parthasarathy, Srinivasan
Sadayappan, P.
[J]. SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, : 781 - 792
[8] Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs
Sedaghati, Naser
Ashari, Arash
Pouchet, Louis-Noel
Parthasarathy, Srinivasan
Sadayappan, P.
[J]. 2ND WORKSHOP ON PARALLEL PROGRAMMING FOR ANALYTICS APPLICATIONS (PPAA 2015), 2015, : 17 - 24
[9] Iterative Sparse Matrix-Vector Multiplication for Integer Factorization on GPUs
Schmidt, Bertil
Aribowo, Hans
Dang, Hoang-Vu
[J]. EURO-PAR 2011 PARALLEL PROCESSING, PT 2, 2011, 6853 : 413 - 424
[10] TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs
Niu, Yuyao
Lu, Zhengyang
Dong, Meichen
Jin, Zhou
Liu, Weifeng
Tan, Guangming
[J]. 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 68 - 78

← 1 2 3 4 5 →