A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

被引：0

作者：

Ruimin Wang

Zhiwei Yang

Hao Xu

Lu Lu

机构：

[1] South China University of Technology,School of Computer Science and Engineering

来源：

The Journal of Supercomputing | 2022年 / 78卷

关键词：

Batched GEMM; HPC; GPU; BLAS;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today’s BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100.

引用

页码：1741 / 1758

页数：17

共 50 条

[1] A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution
Wang, Ruimin
Yang, Zhiwei
Xu, Hao
Lu, Lu
[J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (02): : 1741 - 1758
[2] TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
Rivera, Cody
Chen, Jieyang
Xiong, Nan
Zhang, Jing
Song, Shuaiwen Leon
Tao, Dingwen
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 151 : 70 - 85
[3] Anatomy of high-performance matrix multiplication
Goto, Kazushige
Van De Geijn, Robert A.
[J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
[4] High-Performance Homomorphic Matrix Completion on Multiple GPUs
Zhang, Tao
Lu, Han
Liu, Xiao-Yang
[J]. IEEE ACCESS, 2020, 8 : 25395 - 25406
[5] Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs
Abdelfattah, Ahmad
Tomov, Stanimire
Dongarra, Jack
[J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 111 - 122
[6] A family of high-performance matrix multiplication algorithms
Gunnels, JA
Gustavson, FG
Henry, GM
van de Geijn, RA
[J]. APPLIED PARALLEL COMPUTING: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2006, 3732 : 256 - 265
[7] Matrix Converter Performance Under Unbalanced Input-Voltage
Lozano, Jose M.
Ramirez, Juan M.
[J]. 2008 40TH NORTH AMERICAN POWER SYMPOSIUM (NAPS 2008), 2008, : 500 - +
[8] Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs
Berger, Gonzalo
Freire, Manuel
Marini, Renzo
Dufrechou, Ernesto
Ezzatti, Pablo
[J]. PROCEEDINGS OF SCALA 2021: 12TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE- SCALE SYSTEMS, 2021, : 19 - 26
[9] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
Liu, Weifeng
Vinter, Brian
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
[10] High-Performance Matrix-Vector Multiplication on the GPU
Sorensen, Hans Henrik Brandenborg
[J]. EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 377 - 386

← 1 2 3 4 5 →