An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

被引：5

作者：

Choudhury, Ziaul ^{[1
]}

Shrivastava, Shashwat ^{[1
]}

Ramapantulu, Lavanya ^{[1
]}

Purini, Suresh ^{[1
]}

机构：

[1] Int Inst Informat Technol Hyderabad, Hyderabad, Telangana, India

来源：

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION | 2022年 / 19卷 / 03期

关键词：

FPGAs; convolutional neural networks; accelerators; COMPILING WINDOW OPERATIONS; NEURAL-NETWORKS; ACCELERATOR;

D O I：

10.1145/3519598

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a challenge to meet performance metrics such as latency and throughput while optimizing power. Special-purpose ASICs and FPGAs are suitable candidates to meet these power and performance budgets simultaneously. Rapidly evolving CNN architectures involve novel convolution operations such as point convolutions, depth separable convolutions, and so on. This leads to substantial variation in the computational structure across CNNs and layers within a CNN. Because of this, FPGA reconfigurability provides an attractive tradeoff compared to ASICs. FPGA-based hardware designers address the structural variability issue by generating a network-specific accelerator for a single network or a class of networks. However, homogeneous accelerators are network agnostic and often sacrifice throughput and FPGA LUTs for flexibility. In this article, we propose an FPGA overlay for efficient processing of CNNs that can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1,200 GOps/second on the Virtex7, an improvement of 1.2x to 5x over the recent designs. Also, the reported performance density, measured in giga operations per second per KLUT, is 1.3x to 4x improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.

引用

页数：26

共 50 条

[1] FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism
Shashidhara, Rajath
Stamler, Tim
Kaufmann, Antoine
Peter, Simon
PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), 2022, : 87 - 102
[2] FINE-GRAINED PARALLELISM IN ELLIE
ANDERSEN, B
JOURNAL OF OBJECT-ORIENTED PROGRAMMING, 1992, 5 (03): : 55 - 61
[3] A SLM-based overlay architecture for fine-grained virtual FPGA
Myint, Theingi
Amagasaki, Motoki
Zhao, Qian
Iida, Masahiro
IEICE ELECTRONICS EXPRESS, 2019, 16 (24):
[4] Fine-grained parallelism in computational mathematics
Bandman, OL
PROGRAMMING AND COMPUTER SOFTWARE, 2001, 27 (04) : 170 - 182
[5] Fine-Grained Parallelism in Computational Mathematics
O. L. Bandman
Programming and Computer Software, 2001, 27 : 170 - 182
[6] Fine-grained parallelism accelerating for RNA secondary structure prediction with pseudoknots based on FPGA
Xia, Fei
Jin, Guoqing
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2014, 12 (03)
[7] Revisiting Deep Learning Parallelism: Fine-Grained Inference Engine Utilizing Online Arithmetic
Abdelhadi, Ameer M. S.
Shannon, Lesley
2019 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT 2019), 2019, : 383 - 386
[8] Leveraging Fine-grained Structured Sparsity for CNN Inference on Systolic Array Architectures
Liu, Linqiao
Brown, Stephen
2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021), 2021, : 301 - 305
[9] Evaluation of Fine-grained Parallelism in AUTOSAR Applications
Stegmeier, Alexander
Kehr, Sebastian
George, Dave
Bradatsch, Christian
Panic, Milos
Bodekker, Bert
Ungerer, Theo
INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION (SAMOS 2017), 2017, : 121 - 128
[10] A MATCHING APPROACH TO UTILIZING FINE-GRAINED PARALLELISM
GUPTA, R
SOFFA, ML
PROCEEDINGS OF THE TWENTY-FIRST, ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOLS 1-4: ARCHITECTURE TRACK, SOFTWARE TRACK, DECISION SUPPORT AND KNOWLEDGE BASED SYSTEMS TRACK, APPLICATIONS TRACK, 1988, : 148 - 156

← 1 2 3 4 5 →