Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs

被引：24

作者：

Abdelfattah, Ahmad ^{[1
]}

Tomov, Stanimire ^{[1
]}

Dongarra, Jack ^{[2
,3
,4
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA

[2] Univ Tennessee, Knoxville, TN 37996 USA

[3] Oak Ridge Natl Lab, Oak Ridge, TN USA

[4] Univ Manchester, Manchester, Lancs, England

来源：

2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019) | 2019年

关键词：

Matrix multiplication; batched linear algebra; FP16; arithmetic; GPU computing; LINEAR ALGEBRA;

D O I：

10.1109/IPDPS.2019.00022

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2 x and 10 x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.

引用

页码：111 / 122

页数：12

共 50 条

[1] Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
Charara, Ali
Keyes, David
Ltaief, Hatem
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2019, 45 (02):
[2] Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
Luszczek, Piotr
Kurzak, Jakub
Yamazaki, Ichitaro
Dongarra, Jack
2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
[3] Batched Small Tensor-Matrix Multiplications on GPUs
Zhai, Keke
Banerjee, Tania
Wijayasiri, Adeesha
Ranka, Sanjay
2020 IEEE 27TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2020), 2020, : 305 - 314
[4] Fast Kronecker Matrix-Matrix Multiplication on GPUs
Jangda, Abhinav
Yadav, Mohit
PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 390 - 403
[5] Matched Filtering Accelerated by Tensor Cores on Volta GPUs With Improved Accuracy Using Half-Precision Variables
Yamaguchi, Takuma
Ichimura, Tsuyoshi
Fujita, Kohei
Kato, Aitaro
Nakagawa, Shigeki
IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (12) : 1857 - 1861
[6] Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed-Precision Solvers on GPUs
Abdelfattah, Ahmad
Tomov, Stanimire
Dongarra, Jack
PROCEEDINGS OF SCALA 2019: 2019 IEEE/ACM 10TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE-SCALE SYSTEMS (SCALA), 2019, : 17 - 24
[7] The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
Haidar, Azzam
Abdelfattah, Ahmad
Zounon, Mawussi
Wu, Panruo
Pranesh, Srikara
Tomov, Stanimire
Dongarra, Jack
COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 586 - 600
[8] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
Yan, Da
Wang, Wei
Chu, Xiaowen
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 634 - 643
[9] Half-Precision Logarithmic Arithmetic Unit Based on the Fused Logarithmic and Antilogarithmic Converter
Xiong, Botao
Li, Yukun
Li, Sicun
Fan, Sheng
Chang, Yuchun
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2022, 30 (02) : 243 - 247
[10] tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores
Li, Binrui
Cheng, Shenggan
Lin, James
2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021), 2021, : 1 - 11

← 1 2 3 4 5 →