tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores

被引：10

作者：

Li, Binrui ^{[1
]}

Cheng, Shenggan ^{[1
]}

Lin, James ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Ctr High Performance Comp, Shanghai, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2021) | 2021年

关键词：

Mixed-precision; FFT; GPU; Tensor Cores;

D O I：

10.1109/Cluster48925.2021.00035

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Mixed-precision computing becomes an inevitable trend for HPC and AI applications due to the increasing using mixed-precision units such as NVIDIA Tensor Cores. Fast Fourier transform (FFT) is one of the most widely-used scientific kernels and hence mixed-precision FFT is highly demanded. However, few existing FFT libraries (or algorithms) can support universal size of FFTs on Tensor Cores. Therefore, we proposed tcFFT, a fast half-precision FFT library on Tensor Cores that can support universal size of 1D and 2D FFTs. Our work consists of two parts: framework design and performance optimizations. We designed the tcFFT library framework to support all power-of-two size and multi-dimension of FFTs; we applied two performance optimizations, one to use Tensor Cores efficiently and the other to ease GPU memory bottlenecks. We evaluated tcFFT with a wide range size of 1D and 2D FFTs on NVIDIA V100 and A100 GPUs. The results show that tcFFT can outperform 1.29X-3.24X and 1.10X-3.03X higher on average than NVIDIA cuFFT v11.0 in FP16 on V100 and A100, respectively.

引用

页码：1 / 11

页数：11

共 7 条

[1] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
Yan, Da
Wang, Wei
Chu, Xiaowen
2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 634 - 643
[2] Implementing Single and Half-precision Tensor Operations
Wilson, Kristina
Li, Clifford
Lau, Hon Man
Wong, Kwai
Tomov, Stanimire
PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024, 2024,
[3] Matched Filtering Accelerated by Tensor Cores on Volta GPUs With Improved Accuracy Using Half-Precision Variables
Yamaguchi, Takuma
Ichimura, Tsuyoshi
Fujita, Kohei
Kato, Aitaro
Nakagawa, Shigeki
IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (12) : 1857 - 1861
[4] Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs
Abdelfattah, Ahmad
Tomov, Stanimire
Dongarra, Jack
2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 111 - 122
[5] The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
Haidar, Azzam
Abdelfattah, Ahmad
Zounon, Mawussi
Wu, Panruo
Pranesh, Srikara
Tomov, Stanimire
Dongarra, Jack
COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 586 - 600
[6] Precise and Fast Segmentation of Offshore Farms in High-Resolution SAR Images Based on Model Fusion and Half-Precision Parallel Inference
Yu, Chuang
Liu, Yunpeng
Xia, Xin
Lan, Deyan
Liu, Xin
Wu, Shuhang
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 4861 - 4872
[7] Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
Haidar, Azzam
Tomov, Stanimire
Dongarra, Jack
Higham, Nicholas J.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), 2018,

← 1 →