Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs

被引:24
|
作者
Abdelfattah, Ahmad [1 ]
Tomov, Stanimire [1 ]
Dongarra, Jack [2 ,3 ,4 ]
机构
[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA
[2] Univ Tennessee, Knoxville, TN 37996 USA
[3] Oak Ridge Natl Lab, Oak Ridge, TN USA
[4] Univ Manchester, Manchester, Lancs, England
关键词
Matrix multiplication; batched linear algebra; FP16; arithmetic; GPU computing; LINEAR ALGEBRA;
D O I
10.1109/IPDPS.2019.00022
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2 x and 10 x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.
引用
收藏
页码:111 / 122
页数:12
相关论文
共 50 条
  • [1] Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
    Charara, Ali
    Keyes, David
    Ltaief, Hatem
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2019, 45 (02):
  • [2] Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
    Luszczek, Piotr
    Kurzak, Jakub
    Yamazaki, Ichitaro
    Dongarra, Jack
    2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [3] Batched Small Tensor-Matrix Multiplications on GPUs
    Zhai, Keke
    Banerjee, Tania
    Wijayasiri, Adeesha
    Ranka, Sanjay
    2020 IEEE 27TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2020), 2020, : 305 - 314
  • [4] Fast Kronecker Matrix-Matrix Multiplication on GPUs
    Jangda, Abhinav
    Yadav, Mohit
    PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 390 - 403
  • [5] Matched Filtering Accelerated by Tensor Cores on Volta GPUs With Improved Accuracy Using Half-Precision Variables
    Yamaguchi, Takuma
    Ichimura, Tsuyoshi
    Fujita, Kohei
    Kato, Aitaro
    Nakagawa, Shigeki
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (12) : 1857 - 1861
  • [6] Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed-Precision Solvers on GPUs
    Abdelfattah, Ahmad
    Tomov, Stanimire
    Dongarra, Jack
    PROCEEDINGS OF SCALA 2019: 2019 IEEE/ACM 10TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE-SCALE SYSTEMS (SCALA), 2019, : 17 - 24
  • [7] The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
    Haidar, Azzam
    Abdelfattah, Ahmad
    Zounon, Mawussi
    Wu, Panruo
    Pranesh, Srikara
    Tomov, Stanimire
    Dongarra, Jack
    COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 586 - 600
  • [8] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
    Yan, Da
    Wang, Wei
    Chu, Xiaowen
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 634 - 643
  • [9] Half-Precision Logarithmic Arithmetic Unit Based on the Fused Logarithmic and Antilogarithmic Converter
    Xiong, Botao
    Li, Yukun
    Li, Sicun
    Fan, Sheng
    Chang, Yuchun
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2022, 30 (02) : 243 - 247
  • [10] tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores
    Li, Binrui
    Cheng, Shenggan
    Lin, James
    2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, : 1 - 11