Empowering edge devices: FPGA-based 16-bit fixed-point accelerator with SVD for CNN on 32-bit memory-limited systems

被引：1

作者：

Yanamala, Rama Muni Reddy ^{[1
]}

Pullakandam, Muralidhar ^{[1
]}

机构：

[1] Natl Inst Technol Warangal, Dept ECE, Warangal, Telangana, India

来源：

INTERNATIONAL JOURNAL OF CIRCUIT THEORY AND APPLICATIONS | 2024年 / 52卷 / 09期

关键词：

convolutional neural network; hardware accelerator; high-level synthesis; IP integration; PYNQ-Z2; SVD; CONVOLUTIONAL NEURAL-NETWORKS;

D O I：

10.1002/cta.3957

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Convolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field-programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA-based 16-bit fixed-point hardware accelerator unit for deep learning applications on the 32-bit low-memory edge device (PYNQ-Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI-Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software-based implementation. Its inference speed is 89.03% more than INTEL 3-core CPU, 86.12% higher than Haswell 2-core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures. This paper introduces a 16-bit fixed-point field-programmable gate array (FPGA)-based hardware accelerator for deep learning on a 32-bit low-memory edge device (PYNQ-Z2 board). Singular value decomposition (SVD) optimizes the fully connected layer. The accelerator unit spans all five layers, leveraging eight processing elements for parallel computations in convolution layers 1 and 2. Techniques like array partitioning, loop unrolling, and pipelining enhance computation speed. The accelerator outperforms software-based implementations by 89.03%, 86.12%, and 82.45% against INTEL 3-core CPU, Haswell 2-core CPU, and NVIDIA Tesla K80 GPU, respectively. image

引用

页码：4755 / 4782

页数：28

共 10 条

[1] Design of 16-bit fixed-point CNN coprocessor based on FPGA
Liang, Feng
Yang, Yichen
Zhang, Guohe
Zhang, Xueliang
Wu, Bin
[J]. 2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,
[2] An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP
Safonov, Ilia
Kornilov, Anton
Makienko, Daria
[J]. ELECTRONICS, 2023, 12 (01)
[3] FPGA-Based High-Speed Energy-Efficient 32-Bit Fixed-Point MAC Architecture for DSP Application in IoT Edge Computing
Nagar, Mitul Sudhirkumar
Patel, Sohan H.
Engineer, Pinalkumar
[J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (14)
[4] ANALOG DEVICES ADSP-2100 FAMILY 16-BIT FIXED-POINT DSP
不详
[J]. EDN, 1995, 40 (10) : 47 - 47
[5] An improved audio encoding architecture based on 16-bit fixed-point DSP
Wang, X
Dou, WB
Hou, ZR
[J]. 2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 918 - 921
[6] Memory and computationally efficient psychoacoustic model for MPEG AAC on 16-bit fixed-point processors
Huang, SW
Chen, LG
Tsai, TH
[J]. 2005 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), VOLS 1-6, CONFERENCE PROCEEDINGS, 2005, : 3155 - 3158
[7] Design of a high-speed FPGA-based 32-bit floating-point FFT processor
Mou, Shengmei
Yang, Xiaodong
[J]. SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 1, PROCEEDINGS, 2007, : 84 - +
[8] A low cost embedded mandarin Speech Recognition system based on 16-bit fixed-point DSP
He, Q
[J]. ICCC2004: Proceedings of the 16th International Conference on Computer Communication Vol 1and 2, 2004, : 1203 - 1206
[9] Tradeoff Between Complexity and Memory Size in the 3GPP Enhanced aacPlus Decoder: Speed-Conscious and Memory-Conscious Decoders on a 16-Bit Fixed-Point DSP
Shimada, Osamu
Nomura, Toshiyuki
Sugiyama, Akihiko
Serizawa, Masahiro
[J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2009, 57 (03): : 297 - 303
[10] Tradeoff Between Complexity and Memory Size in the 3GPP Enhanced aacPlus Decoder: Speed-Conscious and Memory-Conscious Decoders on a 16-Bit Fixed-Point DSP
Osamu Shimada
Toshiyuki Nomura
Akihiko Sugiyama
Masahiro Serizawa
[J]. Journal of Signal Processing Systems, 2009, 57 : 297 - 303

← 1 →