SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC

被引:38
|
作者
Higham, Nicholas J. [1 ]
Pranesh, Srikara [1 ]
机构
[1] Univ Manchester, Sch Math, Manchester M13 9PL, Lancs, England
来源
SIAM JOURNAL ON SCIENTIFIC COMPUTING | 2019年 / 41卷 / 05期
基金
英国工程与自然科学研究理事会;
关键词
floating-point arithmetic; half precision; low precision; IEEE arithmetic; fp16; bfloat; 16; subnormal numbers; mixed precision; simulation; rounding error analysis; round to nearest; directed rounding; stochastic rounding; bit flips; MATLAB; ITERATIVE REFINEMENT; ACCURACY;
D O I
10.1137/19M1251308
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
The half-precision (fp16) floating-point format, defined in the 2008 revision of the IEEE standard for floating-point arithmetic, and a more recently proposed half-precision format bfloatl6, are increasingly available in GPUs and other accelerators. While the support for low precision arithmetic is mainly motivated by machine learning applications, general purpose numerical algorithms can benefit from it, too, gaining in speed, energy usage, and reduced communication costs. Since the appropriate hardware is not always available, and one may wish to experiment with new arithmetics not yet implemented in hardware, software simulations of low precision arithmetic are needed. We discuss how to simulate low precision arithmetic using arithmetic of higher precision. We examine the correctness of such simulations and explain via rounding error analysis why a natural method of simulation can provide results that are more accurate than actual computations at low precision. We provide a MATLAB function, chop, that can be used to efficiently simulate fp16, bfloatl6, and other low precision arithmetics, with or without the representation of subnormal numbers and with the options of round to nearest, directed rounding, stochastic rounding, and random bit flips in the significand. We demonstrate the advantages of this approach over defining a new MATLAB class and overloading operators.
引用
收藏
页码:C585 / C602
页数:18
相关论文
共 50 条
  • [1] ARBITRARY PRECISION FLOATING-POINT ARITHMETIC
    MOTTELER, FC
    [J]. DR DOBBS JOURNAL, 1993, 18 (09): : 28 - &
  • [2] Double precision floating-point arithmetic on FPGAs
    Paschalakis, S
    Lee, P
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT), PROCEEDINGS, 2003, : 352 - 358
  • [3] Multiple precision floating-point arithmetic on SIMD processors
    van der Hoeven, Joris
    [J]. 2017 IEEE 24TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2017, : 2 - 9
  • [4] Floating-point arithmetic
    Boldo, Sylvie
    Jeannerod, Claude-Pierre
    Melquiond, Guillaume
    Muller, Jean-Michel
    [J]. ACTA NUMERICA, 2023, 32 : 203 - 290
  • [5] Precision Exploration of Floating-Point Arithmetic for Spiking Neural Networks
    Kwak, Myeongjin
    Seo, Hyoju
    Kim, Yongtae
    [J]. 18TH INTERNATIONAL SOC DESIGN CONFERENCE 2021 (ISOCC 2021), 2021, : 71 - 72
  • [6] Seamless Compiler Integration of Variable Precision Floating-Point Arithmetic
    Jost, Tiago Trevisan
    Durand, Yves
    Fabre, Christian
    Cohen, Albert
    Perrot, Frederic
    [J]. CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2021, : 65 - 76
  • [7] High-precision floating-point arithmetic in scientific computation
    Bailey, DH
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2005, 7 (03) : 54 - 61
  • [8] Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions
    Joldes, Mioara
    Marty, Olivier
    Muller, Jean-Michel
    Popescu, Valentina
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2016, 65 (04) : 1197 - 1210
  • [9] ROUNDINGS IN FLOATING-POINT ARITHMETIC
    YOHE, JM
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1973, C 22 (06) : 577 - 586
  • [10] Hammering Floating-Point Arithmetic
    Torstensson, Olle
    Weber, Tjark
    [J]. FRONTIERS OF COMBINING SYSTEMS, FROCOS 2023, 2023, 14279 : 217 - 235