High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

被引:2
|
作者
Hu, Xianghong [1 ]
Huang, Hongmin [1 ]
Li, Xueming [1 ]
Zheng, Xin [1 ]
Ren, Qinyuan [2 ]
He, Jingyu [3 ]
Xiong, Xiaoming [1 ]
机构
[1] Guangdong Univ Technol, Sch Microelectron, Guangzhou 510006, Guangdong, Peoples R China
[2] Zhejiang Univ, Coll Control Sci & Engn, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong 999077, Peoples R China
关键词
Convolutional neural networks; reconfigurable; accelerator; real-time object detection system; design space exploration; NEURAL-NETWORK; HARDWARE ACCELERATOR;
D O I
10.1145/3530818
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different datamapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Performance improvement of bandwidth-limited coherent OCDMA system
    Chen, Xiaogang
    Chen, Deyi
    Wang, Zonglong
    [J]. PHOTONIC NETWORK COMMUNICATIONS, 2008, 16 (02) : 149 - 154
  • [2] Performance improvement of bandwidth-limited coherent OCDMA system
    Xiaogang Chen
    Deyi Chen
    Zonglong Wang
    [J]. Photonic Network Communications, 2008, 16 : 149 - 154
  • [3] High-performance computing using a reconfigurable accelerator
    Hartenstein, RW
    Becker, J
    Kress, R
    Reinig, H
    [J]. CONCURRENCY-PRACTICE AND EXPERIENCE, 1996, 8 (06): : 429 - 443
  • [4] Introducing a Performance Model for Bandwidth-Limited Loop Kernels
    Treibig, Jan
    Hager, Georg
    [J]. PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT I, 2010, 6067 : 615 - 624
  • [5] Soliton propagation in a system with bandwidth-limited and nonlinear gain
    Ferreira, MFS
    [J]. 17TH CONGRESS OF THE INTERNATIONAL COMMISSION FOR OPTICS: OPTICS FOR SCIENCE AND NEW TECHNOLOGY, PTS 1 AND 2, 1996, 2778 : 233 - 234
  • [6] HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation
    Ye, Hanchen
    Zhang, Xiaofan
    Huang, Zhize
    Chen, Gengsheng
    Chen, Deming
    [J]. PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,
  • [7] Negotiated distributed estimation with guaranteed performance for bandwidth-limited situations
    Orihuela, Luis
    Milian, Pablo
    Roshany-Yamchi, Samira
    Garcia, Ramon A.
    [J]. AUTOMATICA, 2018, 87 : 94 - 102
  • [8] An Automated Fault Detection Program for Multichannel Bandwidth-limited System
    Van Pham, Chi
    Sawtelle, Benjamin
    Imbach, Stephen
    Pham, Anh-Vu
    Jironghe
    [J]. 2017 89TH ARFTG MICROWAVE MEASUREMENT CONFERENCE (ARFTG): ADVANCED TECHNOLOGIES FOR COMMUNICATIONS, 2017,
  • [9] Performance comparison of various end-to-end learning technologies with a bandwidth-limited OWC system
    Wei, Yuan
    Chen, Chaoxu
    Yao, Li
    Zhang, Haoyu
    Li, Ziwei
    Shen, Chao
    Hang, Unwen
    Chi, Nan
    Shi, Jianyang
    [J]. OPTICS EXPRESS, 2024, 32 (19): : 33401 - 33422
  • [10] Implementation of Java']Java accelerator for high-performance embedded systems
    Kimura, M
    Miki, MH
    Onoye, T
    Shirakawa, I
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2003, E86A (12): : 3079 - 3088