Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

被引：0

作者：

Zhao, Zhixiang ^{[1
]}

Zhang, Guoyin ^{[1
]}

Wu, Yanxia ^{[1
]}

Hong, Ruize ^{[1
]}

Yang, Yiqing ^{[1
]}

Fu, Yan ^{[1
]}

机构：

[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2024年 / 80卷 / 10期

关键词：

SpMV; GPU; Mixed-precision; Block-wise; COMPUTING METHOD; SPMV; OPTIMIZATION; FORMAT;

D O I：

10.1007/s11227-024-05949-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sparse matrix-vector multiplication (SpMV) plays a critical role in a wide range of linear algebra computations, particularly in scientific and engineering disciplines. However, the irregular memory access patterns, extensive memory usage, high bandwidth requirements, and underutilization of parallelism hinder the computational efficiency of SpMV on GPUs. In this paper, we propose a novel approach called block-wise dynamic mixed-precision (BDMP) to address these challenges. Our methodology involves partitioning the original matrix into uniformly sized blocks, with each block's size determined by considering architectural characteristics and accuracy requirements. Additionally, we dynamically assign precision to each block using a precision selection method that takes into account the value distribution of the original sparse matrix. We develop two distinct SpMV computation algorithms for BDMP: BDMP-PBP (Precision-based partitioning) and BDMP-TCKI (Tailored compression and kernel implementation). BDMP-PBP partitions the matrix into two independent matrices for separate computations based on block precision, offering flexibility for integration with other optimization techniques. Meanwhile, BDMP-TCKI focuses on achieving significant thread-level parallelism and memory utilization by tailoring an appropriate compressed storage format and kernel implementation for each block. We compare BDMP with NVIDIA's cuSPARSE library and three state-of-the-art SpMV methods, including SELLP, MergeBase, and BalanceCSR, using matrices from the University of Florida's SuiteSparse dataset collection. BDMP-PBP and BDMP-TCKI show average speedups up to 2.64x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 2.91x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Turing RTX 2080Ti, and up to 2.99x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and 3.22x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Ampere A100. The results demonstrate that BDMP enables the optimization of computation speed without compromising the precision necessary for reliable results.

引用

页码：13681 / 13713

页数：33

共 50 条

[31] Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Boukaram, Wajih
Turkiyyah, George
Keyes, David
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2019, 45 (01):
[32] Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
AlAhmadi, Sarah
Mohammed, Thaha
Albeshri, Aiiad
Katib, Iyad
Mehmood, Rashid
[J]. ELECTRONICS, 2020, 9 (10) : 1 - 30
[33] GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication
Tao, Yuan
Deng, Yangdong
Mu, Shuai
Zhang, Zhenzhong
Zhu, Mingfa
Xiao, Limin
Ruan, Li
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (14): : 3771 - 3789
[34] Vector ISA extension for sparse matrix-vector multiplication
Vassiliadis, S
Cotofana, S
Stathis, P
[J]. EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 708 - 715
[35] Node aware sparse matrix-vector multiplication
Bienz, Amanda
Gropp, William D.
Olson, Luke N.
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2019, 130 : 166 - 178
[36] Sparse Matrix-Vector Multiplication on a Reconfigurable Supercomputer
DuBois, David
DuBois, Andrew
Connor, Carolyn
Poole, Steve
[J]. PROCEEDINGS OF THE SIXTEENTH IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, 2008, : 239 - +
[37] Understanding the performance of sparse matrix-vector multiplication
Goumas, Georgios
Kourtis, Kornilios
Anastopoulos, Nikos
Karakasis, Vasileios
Koziris, Nectarios
[J]. PROCEEDINGS OF THE 16TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2008, : 283 - +
[38] Sparse matrix-vector multiplication design on FPGAs
Sun, Junqing
Peterson, Gregory
Storaasli, Olaf
[J]. FCCM 2007: 15TH ANNUAL IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2007, : 349 - +
[39] STRUCTURED SPARSE MATRIX-VECTOR MULTIPLICATION ON A MASPAR
DEHN, T
EIERMANN, M
GIEBERMANN, K
SPERLING, V
[J]. ZEITSCHRIFT FUR ANGEWANDTE MATHEMATIK UND MECHANIK, 1994, 74 (06): : T534 - T538
[40] Performance Aspects of Sparse Matrix-Vector Multiplication
Simecek, I.
[J]. ACTA POLYTECHNICA, 2006, 46 (03) : 3 - 8

← 1 2 3 4 5 →