Mechanical derivation of fused multiply-add algorithms for linear transforms

被引:5
|
作者
Voronenko, Yevgen [1 ]
Pueschel, Markus [1 ]
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
automatic program generation; discrete cosine transform (DCT); discrete Fourier transform (DFT); fast algorithm; implementation; multiply-and-accumulate (MAC); instruction; multiply and accumulate (MAC);
D O I
10.1109/TSP.2007.896116
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Several computer architectures offer fused multiply-add (FMA), also called multiply-and-accumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAG-based version of the method and show that we can generate many of the best-known hand-derived FMA, algorithms from the literature as well as a few novel FMA algorithms.
引用
收藏
页码:4458 / 4473
页数:16
相关论文
共 50 条
  • [1] Automatic generation of implementations for DSP transforms on fused multiply-add architectures
    Voronenko, Y
    Püschel, M
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 101 - 104
  • [2] Exhaustive Testing of Fused Multiply-Add RTL
    Burgess, Neil
    Lutz, David R.
    2013 ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2013, : 405 - 406
  • [3] MODIFIED FFTS FOR FUSED MULTIPLY-ADD ARCHITECTURES
    LINZER, E
    FEIG, E
    MATHEMATICS OF COMPUTATION, 1993, 60 (201) : 347 - 361
  • [4] Floating-point fused multiply-add architectures
    Quinnell, Eric
    Swartzlander, Earl E., Jr.
    Lemonds, Carl
    CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, 2007, : 331 - +
  • [5] Formally Verified Argument Reduction with a Fused Multiply-Add
    Boldo, Sylvie
    Daumas, Marc
    Li, Ren-Cang
    IEEE TRANSACTIONS ON COMPUTERS, 2009, 58 (08) : 1139 - 1145
  • [6] Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines
    Lutz, David R.
    2011 20TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH-20), 2011, : 123 - 128
  • [7] Fused Multiply-Add for Variable Precision Floating-Point
    Nannarelli, Alberto
    32ND IEEE INTERNATIONAL SYSTEM ON CHIP CONFERENCE (IEEE SOCC 2019), 2019, : 342 - 347
  • [8] Floating-point fused multiply-add with reduced latency
    Lang, T
    Bruguera, JD
    ICCD'2002: IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 2002, : 145 - 150
  • [9] Bridge Floating-Point Fused Multiply-Add Design
    Quinnell, Eric
    Swartzlander, Earl E., Jr.
    Lemonds, Carl
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2008, 16 (12) : 1726 - 1730
  • [10] Floating-Point Fused Multiply-Add under HUB Format
    Hormigo, Javier
    Villalba-Moreno, Julio
    Gonzalez-Navarro, Sonia
    2020 IEEE 27TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2020, : 1 - 8