Mechanical derivation of fused multiply-add algorithms for linear transforms

被引:5
|
作者
Voronenko, Yevgen [1 ]
Pueschel, Markus [1 ]
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
automatic program generation; discrete cosine transform (DCT); discrete Fourier transform (DFT); fast algorithm; implementation; multiply-and-accumulate (MAC); instruction; multiply and accumulate (MAC);
D O I
10.1109/TSP.2007.896116
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Several computer architectures offer fused multiply-add (FMA), also called multiply-and-accumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAG-based version of the method and show that we can generate many of the best-known hand-derived FMA, algorithms from the literature as well as a few novel FMA algorithms.
引用
收藏
页码:4458 / 4473
页数:16
相关论文
共 50 条
  • [21] Binary matrices, decomposition and multiply-add architectures
    Sarukhanyan, H
    Agaian, S
    Astola, J
    Egiazarian, K
    IMAGE PROCESSING: ALGORITHMS AND SYSTEMS II, 2003, 5014 : 111 - 122
  • [22] Development of a RISC-V-conform fused multiply-add floating-point unit
    Kaiser F.
    Kosnac S.
    Brüning U.
    Supercomputing Frontiers and Innovations, 2019, 6 (02) : 64 - 74
  • [23] Leading zero anticipation for latency improvement in floating-point fused multiply-add units
    Mei, XL
    2005 6th International Conference on ASIC Proceedings, Books 1 and 2, 2005, : 128 - 131
  • [24] Implementation of Low Power and Area Efficient Floating-Point Fused Multiply-Add Unit
    Dhanabal, R.
    Sahoo, Sarat Kumar
    Bharathi, V.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING SYSTEMS, ICSCS 2015, VOL 1, 2016, 397 : 329 - 342
  • [25] Complex multiply-add and other related operators
    Ercegovac, Milos D.
    Muller, Jean-Michel
    ADVANCED SIGNAL PROCESSING ALGORITHMS, ARCHITECTURES, AND IMPLEMENTATIONS XVII, 2007, 6697
  • [26] Floating-point fused multiply-add: Reduced latency for floating-point addition
    Bruguera, JD
    Lang, T
    17TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC, PROCEEDINGS, 2005, : 42 - 51
  • [27] Energy Efficient Speed-Independent 64-bit Fused Multiply-Add Unit
    Stepchenkov, Yury
    Stepchenkov, Dmitry
    Rogdestvenski, Yury
    Shikunov, Yury
    Diachenko, Yury
    PROCEEDINGS OF THE 2019 IEEE CONFERENCE OF RUSSIAN YOUNG RESEARCHERS IN ELECTRICAL AND ELECTRONIC ENGINEERING (EICONRUS), 2019, : 1709 - 1714
  • [28] MIXED PRECISION BLOCK FUSED MULTIPLY-ADD: ERROR ANALYSIS AND APPLICATION TO GPU TENSOR CORES
    Blanchard, Pierre
    Higham, Nicholas J.
    Lopez, Florent
    Mary, Theo
    Pranesh, Srikara
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2020, 42 (03): : C124 - C141
  • [29] IMPLEMENTATION OF EFFICIENT FFT ALGORITHMS ON FUSED MULTIPLY ADD ARCHITECTURES
    LINZER, EN
    FEIG, E
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 1993, 41 (01) : 93 - 107
  • [30] A Low-Power Approximate Multiply-Add Unit
    Yang, Tongxin
    Sato, Toshinori
    Ukezono, Tomoaki
    2019 2ND INTERNATIONAL SYMPOSIUM ON DEVICES, CIRCUITS AND SYSTEMS (ISDCS 2019), 2019,