tcFFT: A Fast Half-Precision FFT Library for NVIDIA Tensor Cores

被引:10
|
作者
Li, Binrui [1 ]
Cheng, Shenggan [1 ]
Lin, James [1 ]
机构
[1] Shanghai Jiao Tong Univ, Ctr High Performance Comp, Shanghai, Peoples R China
关键词
Mixed-precision; FFT; GPU; Tensor Cores;
D O I
10.1109/Cluster48925.2021.00035
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Mixed-precision computing becomes an inevitable trend for HPC and AI applications due to the increasing using mixed-precision units such as NVIDIA Tensor Cores. Fast Fourier transform (FFT) is one of the most widely-used scientific kernels and hence mixed-precision FFT is highly demanded. However, few existing FFT libraries (or algorithms) can support universal size of FFTs on Tensor Cores. Therefore, we proposed tcFFT, a fast half-precision FFT library on Tensor Cores that can support universal size of 1D and 2D FFTs. Our work consists of two parts: framework design and performance optimizations. We designed the tcFFT library framework to support all power-of-two size and multi-dimension of FFTs; we applied two performance optimizations, one to use Tensor Cores efficiently and the other to ease GPU memory bottlenecks. We evaluated tcFFT with a wide range size of 1D and 2D FFTs on NVIDIA V100 and A100 GPUs. The results show that tcFFT can outperform 1.29X-3.24X and 1.10X-3.03X higher on average than NVIDIA cuFFT v11.0 in FP16 on V100 and A100, respectively.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 7 条
  • [1] Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
    Yan, Da
    Wang, Wei
    Chu, Xiaowen
    2020 IEEE 34TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM IPDPS 2020, 2020, : 634 - 643
  • [2] Implementing Single and Half-precision Tensor Operations
    Wilson, Kristina
    Li, Clifford
    Lau, Hon Man
    Wong, Kwai
    Tomov, Stanimire
    PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2024, PEARC 2024, 2024,
  • [3] Matched Filtering Accelerated by Tensor Cores on Volta GPUs With Improved Accuracy Using Half-Precision Variables
    Yamaguchi, Takuma
    Ichimura, Tsuyoshi
    Fujita, Kohei
    Kato, Aitaro
    Nakagawa, Shigeki
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (12) : 1857 - 1861
  • [4] Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs
    Abdelfattah, Ahmad
    Tomov, Stanimire
    Dongarra, Jack
    2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 111 - 122
  • [5] The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
    Haidar, Azzam
    Abdelfattah, Ahmad
    Zounon, Mawussi
    Wu, Panruo
    Pranesh, Srikara
    Tomov, Stanimire
    Dongarra, Jack
    COMPUTATIONAL SCIENCE - ICCS 2018, PT I, 2018, 10860 : 586 - 600
  • [6] Precise and Fast Segmentation of Offshore Farms in High-Resolution SAR Images Based on Model Fusion and Half-Precision Parallel Inference
    Yu, Chuang
    Liu, Yunpeng
    Xia, Xin
    Lan, Deyan
    Liu, Xin
    Wu, Shuhang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 4861 - 4872
  • [7] Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
    Haidar, Azzam
    Tomov, Stanimire
    Dongarra, Jack
    Higham, Nicholas J.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), 2018,