An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

被引:5
|
作者
Choudhury, Ziaul [1 ]
Shrivastava, Shashwat [1 ]
Ramapantulu, Lavanya [1 ]
Purini, Suresh [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Hyderabad, Telangana, India
关键词
FPGAs; convolutional neural networks; accelerators; COMPILING WINDOW OPERATIONS; NEURAL-NETWORKS; ACCELERATOR;
D O I
10.1145/3519598
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a challenge to meet performance metrics such as latency and throughput while optimizing power. Special-purpose ASICs and FPGAs are suitable candidates to meet these power and performance budgets simultaneously. Rapidly evolving CNN architectures involve novel convolution operations such as point convolutions, depth separable convolutions, and so on. This leads to substantial variation in the computational structure across CNNs and layers within a CNN. Because of this, FPGA reconfigurability provides an attractive tradeoff compared to ASICs. FPGA-based hardware designers address the structural variability issue by generating a network-specific accelerator for a single network or a class of networks. However, homogeneous accelerators are network agnostic and often sacrifice throughput and FPGA LUTs for flexibility. In this article, we propose an FPGA overlay for efficient processing of CNNs that can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1,200 GOps/second on the Virtex7, an improvement of 1.2x to 5x over the recent designs. Also, the reported performance density, measured in giga operations per second per KLUT, is 1.3x to 4x improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.
引用
收藏
页数:26
相关论文
共 50 条
  • [31] Testing fine-grained parallelism for the ADMM on a factor-graph
    Hao, Ning
    Oghbaee, AmirReza
    Rostami, Mohammad
    Derbinsky, Nate
    Bento, Jose
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 835 - 844
  • [32] Bilinear CNN Models for Fine-grained Visual Recognition
    Lin, Tsung-Yu
    RoyChowdhury, Aruni
    Maji, Subhransu
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1449 - 1457
  • [33] Fast Attention CNN for Fine-Grained Crack Segmentation
    Lee, Hyunnam
    Yoo, Juhan
    [J]. SENSORS, 2023, 23 (04)
  • [34] Accelerating CNN Algorithm with Fine-grained Dataflow Architectures
    Xiang, Taoran
    Feng, Yujing
    Ye, Xiaochun
    Tan, Xu
    Li, Wenming
    Zhu, Yatao
    Wu, Meng
    Zhang, Hao
    Fan, Dongrui
    [J]. IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 243 - 251
  • [35] Fine-Grained Crowdsourcing for Fine-Grained Recognition
    Jia Deng
    Krause, Jonathan
    Li Fei-Fei
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587
  • [36] BALANCING FINE-GRAINED AND MEDIUM-GRAINED PARALLELISM IN SCHEDULING LOOPS FOR THE XIMD ARCHITECTURE
    NEWBURN, CJ
    HUANG, AS
    SHEN, JP
    [J]. IFIP TRANSACTIONS A-COMPUTER SCIENCE AND TECHNOLOGY, 1993, 23 : 39 - 52
  • [37] Fine-Grained Defect Diagnosis for CMOL FPGA Circuits
    Kim, Jihye
    Lee, Hayoung
    Jang, Seokjun
    Kang, Sungho
    [J]. IEEE ACCESS, 2020, 8 (08): : 163140 - 163151
  • [38] Fine-Grained Urban Flow Inference With Incomplete Data
    Li, Jiyue
    Wang, Senzhang
    Zhang, Jiaqiang
    Miao, Hao
    Zhang, Junbo
    Yu, Philip S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (06) : 5851 - 5864
  • [39] A Systematic Evaluation: Fine-Grained CNN vs. Traditional CNN Classifiers
    Anwar, Saeed
    Barnes, Nick
    Petersson, Lars
    [J]. ELECTRONICS, 2023, 12 (23)
  • [40] Towards Fine-grained Parallelism in Parallel and Distributed Python']Python Libraries
    Kerney, Jamison
    Raicu, Joan
    Raicu, John
    Chard, Kyle
    [J]. 2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 706 - 715