Empowering edge devices: FPGA-based 16-bit fixed-point accelerator with SVD for CNN on 32-bit memory-limited systems

被引:1
|
作者
Yanamala, Rama Muni Reddy [1 ]
Pullakandam, Muralidhar [1 ]
机构
[1] Natl Inst Technol Warangal, Dept ECE, Warangal, Telangana, India
关键词
convolutional neural network; hardware accelerator; high-level synthesis; IP integration; PYNQ-Z2; SVD; CONVOLUTIONAL NEURAL-NETWORKS;
D O I
10.1002/cta.3957
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Convolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field-programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA-based 16-bit fixed-point hardware accelerator unit for deep learning applications on the 32-bit low-memory edge device (PYNQ-Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI-Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software-based implementation. Its inference speed is 89.03% more than INTEL 3-core CPU, 86.12% higher than Haswell 2-core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures. This paper introduces a 16-bit fixed-point field-programmable gate array (FPGA)-based hardware accelerator for deep learning on a 32-bit low-memory edge device (PYNQ-Z2 board). Singular value decomposition (SVD) optimizes the fully connected layer. The accelerator unit spans all five layers, leveraging eight processing elements for parallel computations in convolution layers 1 and 2. Techniques like array partitioning, loop unrolling, and pipelining enhance computation speed. The accelerator outperforms software-based implementations by 89.03%, 86.12%, and 82.45% against INTEL 3-core CPU, Haswell 2-core CPU, and NVIDIA Tesla K80 GPU, respectively. image
引用
收藏
页码:4755 / 4782
页数:28
相关论文
共 10 条
  • [1] Design of 16-bit fixed-point CNN coprocessor based on FPGA
    Liang, Feng
    Yang, Yichen
    Zhang, Guohe
    Zhang, Xueliang
    Wu, Bin
    [J]. 2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,
  • [2] An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP
    Safonov, Ilia
    Kornilov, Anton
    Makienko, Daria
    [J]. ELECTRONICS, 2023, 12 (01)
  • [3] FPGA-Based High-Speed Energy-Efficient 32-Bit Fixed-Point MAC Architecture for DSP Application in IoT Edge Computing
    Nagar, Mitul Sudhirkumar
    Patel, Sohan H.
    Engineer, Pinalkumar
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2024, 33 (14)
  • [4] ANALOG DEVICES ADSP-2100 FAMILY 16-BIT FIXED-POINT DSP
    不详
    [J]. EDN, 1995, 40 (10) : 47 - 47
  • [5] An improved audio encoding architecture based on 16-bit fixed-point DSP
    Wang, X
    Dou, WB
    Hou, ZR
    [J]. 2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 918 - 921
  • [6] Memory and computationally efficient psychoacoustic model for MPEG AAC on 16-bit fixed-point processors
    Huang, SW
    Chen, LG
    Tsai, TH
    [J]. 2005 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), VOLS 1-6, CONFERENCE PROCEEDINGS, 2005, : 3155 - 3158
  • [7] Design of a high-speed FPGA-based 32-bit floating-point FFT processor
    Mou, Shengmei
    Yang, Xiaodong
    [J]. SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 1, PROCEEDINGS, 2007, : 84 - +
  • [8] A low cost embedded mandarin Speech Recognition system based on 16-bit fixed-point DSP
    He, Q
    [J]. ICCC2004: Proceedings of the 16th International Conference on Computer Communication Vol 1and 2, 2004, : 1203 - 1206
  • [9] Tradeoff Between Complexity and Memory Size in the 3GPP Enhanced aacPlus Decoder: Speed-Conscious and Memory-Conscious Decoders on a 16-Bit Fixed-Point DSP
    Shimada, Osamu
    Nomura, Toshiyuki
    Sugiyama, Akihiko
    Serizawa, Masahiro
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2009, 57 (03): : 297 - 303
  • [10] Tradeoff Between Complexity and Memory Size in the 3GPP Enhanced aacPlus Decoder: Speed-Conscious and Memory-Conscious Decoders on a 16-Bit Fixed-Point DSP
    Osamu Shimada
    Toshiyuki Nomura
    Akihiko Sugiyama
    Masahiro Serizawa
    [J]. Journal of Signal Processing Systems, 2009, 57 : 297 - 303