Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

被引:1
|
作者
Dehghanpour, Alireza [1 ]
Kordestani, Javad Khodamoradi [1 ]
Dehyadegari, Masoud [1 ,2 ]
机构
[1] K N Toosi Univ Technol, Fac Comp Engn, Tehran 1631714191, Iran
[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, 193955746, Tehran, Iran
关键词
Deep neural networks; Floating point; Sorting; AlexNet; Convolutional neural networks;
D O I
10.1007/s11063-023-11409-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.
引用
收藏
页码:12061 / 12078
页数:18
相关论文
共 50 条
  • [41] Optimization Modulo the Theories of Signed Bit-Vectors and Floating-Point Numbers
    Patrick Trentin
    Roberto Sebastiani
    Journal of Automated Reasoning, 2021, 65 : 1071 - 1096
  • [42] 32-bit logarithmic arithmetic unit and its performance compared to floating-point
    The University, Newcastle upon Tyne, United Kingdom
    Proc Symp Comput Arith, (142-151):
  • [43] Konrad Zuse and Floating-Point Numbers
    Winkler, Juergen F. H.
    COMMUNICATIONS OF THE ACM, 2012, 55 (10) : 6 - 7
  • [44] Practically Accurate Floating-Point Math
    Toronto, Neil
    McCarthy, Jay
    COMPUTING IN SCIENCE & ENGINEERING, 2014, 16 (04) : 80 - +
  • [45] Accurate Parallel Floating-Point Accumulation
    Kadric, Edin
    Gurniak, Paul
    DeHon, Andre
    2013 21ST IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2013, : 153 - 162
  • [46] A PROPOSED STANDARD FOR BINARY FLOATING-POINT ARITHMETIC
    STEVENSON, D
    COMPUTER, 1981, 14 (03) : 51 - 62
  • [47] Accurate Floating-Point Product and Exponentiation
    Graillat, Stef
    IEEE TRANSACTIONS ON COMPUTERS, 2009, 58 (07) : 994 - 1000
  • [48] ANALYSIS OF ROUNDING METHODS IN FLOATING-POINT ARITHMETIC
    KUCK, DJ
    PARKER, DS
    SAMEH, AH
    IEEE TRANSACTIONS ON COMPUTERS, 1977, 26 (07) : 643 - 650
  • [49] Computing integer powers in floating-point arithmetic
    Kornerup, Peter
    Lefevre, Vincent
    Muller, Jean-Michel
    CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, 2007, : 343 - +
  • [50] DSP TACKLES FLOATING-POINT ARITHMETIC.
    Ferro, Frank
    Electronic Systems Technology and Design/Computer Design's, 1986, 25 (15): : 53 - 56