A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

被引:0
|
作者
Ruimin Wang
Zhiwei Yang
Hao Xu
Lu Lu
机构
[1] South China University of Technology,School of Computer Science and Engineering
来源
关键词
Batched GEMM; HPC; GPU; BLAS;
D O I
暂无
中图分类号
学科分类号
摘要
In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today’s BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100.
引用
收藏
页码:1741 / 1758
页数:17
相关论文
共 50 条
  • [1] A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution
    Wang, Ruimin
    Yang, Zhiwei
    Xu, Hao
    Lu, Lu
    [J]. JOURNAL OF SUPERCOMPUTING, 2022, 78 (02): : 1741 - 1758
  • [2] TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
    Rivera, Cody
    Chen, Jieyang
    Xiong, Nan
    Zhang, Jing
    Song, Shuaiwen Leon
    Tao, Dingwen
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 151 : 70 - 85
  • [3] Anatomy of high-performance matrix multiplication
    Goto, Kazushige
    Van De Geijn, Robert A.
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
  • [4] High-Performance Homomorphic Matrix Completion on Multiple GPUs
    Zhang, Tao
    Lu, Han
    Liu, Xiao-Yang
    [J]. IEEE ACCESS, 2020, 8 : 25395 - 25406
  • [5] Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs
    Abdelfattah, Ahmad
    Tomov, Stanimire
    Dongarra, Jack
    [J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 111 - 122
  • [6] A family of high-performance matrix multiplication algorithms
    Gunnels, JA
    Gustavson, FG
    Henry, GM
    van de Geijn, RA
    [J]. APPLIED PARALLEL COMPUTING: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2006, 3732 : 256 - 265
  • [7] Matrix Converter Performance Under Unbalanced Input-Voltage
    Lozano, Jose M.
    Ramirez, Juan M.
    [J]. 2008 40TH NORTH AMERICAN POWER SYMPOSIUM (NAPS 2008), 2008, : 500 - +
  • [8] Unleashing the performance of bmSparse for the sparse matrix multiplication in GPUs
    Berger, Gonzalo
    Freire, Manuel
    Marini, Renzo
    Dufrechou, Ernesto
    Ezzatti, Pablo
    [J]. PROCEEDINGS OF SCALA 2021: 12TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE- SCALE SYSTEMS, 2021, : 19 - 26
  • [9] A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors
    Liu, Weifeng
    Vinter, Brian
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2015, 85 : 47 - 61
  • [10] High-Performance Matrix-Vector Multiplication on the GPU
    Sorensen, Hans Henrik Brandenborg
    [J]. EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT I, 2012, 7155 : 377 - 386