Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

被引：1

作者：

Dehghanpour, Alireza ^{[1
]}

Kordestani, Javad Khodamoradi ^{[1
]}

Dehyadegari, Masoud ^{[1
,2
]}

机构：

[1] K N Toosi Univ Technol, Fac Comp Engn, Tehran 1631714191, Iran

[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, 193955746, Tehran, Iran

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 09期

关键词：

Deep neural networks; Floating point; Sorting; AlexNet; Convolutional neural networks;

D O I：

10.1007/s11063-023-11409-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.

引用

页码：12061 / 12078

页数：18

共 50 条

[41] Optimization Modulo the Theories of Signed Bit-Vectors and Floating-Point Numbers
Patrick Trentin
Roberto Sebastiani
Journal of Automated Reasoning, 2021, 65 : 1071 - 1096
[42] 32-bit logarithmic arithmetic unit and its performance compared to floating-point
The University, Newcastle upon Tyne, United Kingdom
Proc Symp Comput Arith, (142-151):
[43] Konrad Zuse and Floating-Point Numbers
Winkler, Juergen F. H.
COMMUNICATIONS OF THE ACM, 2012, 55 (10) : 6 - 7
[44] Practically Accurate Floating-Point Math
Toronto, Neil
McCarthy, Jay
COMPUTING IN SCIENCE & ENGINEERING, 2014, 16 (04) : 80 - +
[45] Accurate Parallel Floating-Point Accumulation
Kadric, Edin
Gurniak, Paul
DeHon, Andre
2013 21ST IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2013, : 153 - 162
[46] A PROPOSED STANDARD FOR BINARY FLOATING-POINT ARITHMETIC
STEVENSON, D
COMPUTER, 1981, 14 (03) : 51 - 62
[47] Accurate Floating-Point Product and Exponentiation
Graillat, Stef
IEEE TRANSACTIONS ON COMPUTERS, 2009, 58 (07) : 994 - 1000
[48] ANALYSIS OF ROUNDING METHODS IN FLOATING-POINT ARITHMETIC
KUCK, DJ
PARKER, DS
SAMEH, AH
IEEE TRANSACTIONS ON COMPUTERS, 1977, 26 (07) : 643 - 650
[49] Computing integer powers in floating-point arithmetic
Kornerup, Peter
Lefevre, Vincent
Muller, Jean-Michel
CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, 2007, : 343 - +
[50] DSP TACKLES FLOATING-POINT ARITHMETIC.
Ferro, Frank
Electronic Systems Technology and Design/Computer Design's, 1986, 25 (15): : 53 - 56

← 1 2 3 4 5 →