An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

被引:5
|
作者
Choudhury, Ziaul [1 ]
Shrivastava, Shashwat [1 ]
Ramapantulu, Lavanya [1 ]
Purini, Suresh [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Hyderabad, Telangana, India
关键词
FPGAs; convolutional neural networks; accelerators; COMPILING WINDOW OPERATIONS; NEURAL-NETWORKS; ACCELERATOR;
D O I
10.1145/3519598
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a challenge to meet performance metrics such as latency and throughput while optimizing power. Special-purpose ASICs and FPGAs are suitable candidates to meet these power and performance budgets simultaneously. Rapidly evolving CNN architectures involve novel convolution operations such as point convolutions, depth separable convolutions, and so on. This leads to substantial variation in the computational structure across CNNs and layers within a CNN. Because of this, FPGA reconfigurability provides an attractive tradeoff compared to ASICs. FPGA-based hardware designers address the structural variability issue by generating a network-specific accelerator for a single network or a class of networks. However, homogeneous accelerators are network agnostic and often sacrifice throughput and FPGA LUTs for flexibility. In this article, we propose an FPGA overlay for efficient processing of CNNs that can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1,200 GOps/second on the Virtex7, an improvement of 1.2x to 5x over the recent designs. Also, the reported performance density, measured in giga operations per second per KLUT, is 1.3x to 4x improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.
引用
收藏
页数:26
相关论文
共 50 条
  • [21] EXPOSING FINE-GRAINED PARALLELISM IN ALGEBRAIC MULTIGRID METHODS
    Bell, Nathan
    Dalton, Steven
    Olson, Luke N.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2012, 34 (04): : C123 - C152
  • [22] Towards High Performance and Accurate BNN Inference on FPGA with Structured Fine-grained Pruning
    Fu, Keqi
    Qi, Zhi
    Cai, Jiaxuan
    Shi, Xulong
    2022 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN, ICCAD, 2022,
  • [23] Fine-Grained Entity Typing with Hierarchical Inference
    Ren, Quan
    PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 2552 - 2558
  • [24] A MULTIPROCESSOR ARCHITECTURE COMBINING FINE-GRAINED AND COARSE-GRAINED PARALLELISM STRATEGIES
    LILJA, DJ
    PARALLEL COMPUTING, 1994, 20 (05) : 729 - 751
  • [25] Study of Fine-grained Nested Parallelism in CDCL SAT Solvers
    Edwards, James
    Vishkin, Uzi
    ACM TRANSACTIONS ON PARALLEL COMPUTING, 2021, 8 (03)
  • [26] Fine-grained parallelism in probabilistic parsing with Habanero Java']Java
    Francis-Landau, Matthew
    Xue, Bing
    Eisner, Jason
    Sarkar, Vivek
    PROCEEDINGS OF 2016 6TH WORKSHOP ON IRREGULAR APPLICATIONS: ARCHITECTURE AND ALGORITHMS (IA3), 2016, : 78 - 81
  • [27] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors
    Kumar, Sanjeev
    Hughes, Christopher J.
    Nguyen, Anthony
    ISCA'07: 34TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, CONFERENCE PROCEEDINGS, 2007, : 162 - 173
  • [28] FINGERS: Exploiting Fine-Grained Parallelism in Graph Mining Accelerators
    Chen, Qihang
    Tian, Boyu
    Gao, Mingyu
    ASPLOS '22: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2022, : 43 - 55
  • [29] Towards Fine-Grained Dataflow Parallelism in Big Data Systems
    Ertel, Sebastian
    Adam, Justus
    Castrillon, Jeronimo
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, LCPC 2017, 2019, 11403 : 281 - 282
  • [30] Accelerating a Lossy Compression Method with Fine-Grained Parallelism on a GPU
    Wu, Yifan
    Shen, Jingcheng
    Okita, Masao
    Ino, Fumihiko
    PAAP 2021: 2021 12TH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING, 2021, : 76 - 81