An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

被引:5
|
作者
Choudhury, Ziaul [1 ]
Shrivastava, Shashwat [1 ]
Ramapantulu, Lavanya [1 ]
Purini, Suresh [1 ]
机构
[1] Int Inst Informat Technol Hyderabad, Hyderabad, Telangana, India
关键词
FPGAs; convolutional neural networks; accelerators; COMPILING WINDOW OPERATIONS; NEURAL-NETWORKS; ACCELERATOR;
D O I
10.1145/3519598
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a challenge to meet performance metrics such as latency and throughput while optimizing power. Special-purpose ASICs and FPGAs are suitable candidates to meet these power and performance budgets simultaneously. Rapidly evolving CNN architectures involve novel convolution operations such as point convolutions, depth separable convolutions, and so on. This leads to substantial variation in the computational structure across CNNs and layers within a CNN. Because of this, FPGA reconfigurability provides an attractive tradeoff compared to ASICs. FPGA-based hardware designers address the structural variability issue by generating a network-specific accelerator for a single network or a class of networks. However, homogeneous accelerators are network agnostic and often sacrifice throughput and FPGA LUTs for flexibility. In this article, we propose an FPGA overlay for efficient processing of CNNs that can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1,200 GOps/second on the Virtex7, an improvement of 1.2x to 5x over the recent designs. Also, the reported performance density, measured in giga operations per second per KLUT, is 1.3x to 4x improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.
引用
收藏
页数:26
相关论文
共 50 条
  • [41] Reducing Query Latencies in Web Search Using Fine-Grained Parallelism
    Eitan Frachtenberg
    [J]. World Wide Web, 2009, 12 : 441 - 460
  • [42] A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applications
    Paul, Sri Raj
    Hayashi, Akihiro
    Chen, Kun
    Elmougy, Youssef
    Sarkar, Vivek
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2023, 69
  • [44] Fine-grained adaptive parallelism for automotive systems through AMALTHEA and OpenMP
    Munera, Adrian
    Royuela, Sara
    Pressler, Michael
    Mackamul, Harald
    Ziegenbein, Dirk
    Quinones, Eduardo
    [J]. JOURNAL OF SYSTEMS ARCHITECTURE, 2024, 146
  • [45] Exploiting Fine-Grained Pipeline Parallelism for Wavefront Computations on Multicore Platforms
    Wu, Guiming
    Wang, Miao
    Dou, Yong
    Xia, Fei
    [J]. 2009 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPPW 2009), 2009, : 402 - 408
  • [46] EXPRESSING FINE-GRAINED PARALLELISM USING CONCURRENT DATA-STRUCTURES
    JAGANNATHAN, S
    [J]. LECTURE NOTES IN COMPUTER SCIENCE, 1992, 574 : 77 - 92
  • [47] Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals
    Singhal, Vidush
    Sakka, Laith
    Sundararajah, Kirshanthan
    Newton, Ryan R.
    Kulkarni, Milind
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (02)
  • [48] Reducing Query Latencies in Web Search Using Fine-Grained Parallelism
    Frachtenberg, Eitan
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2009, 12 (04): : 441 - 460
  • [49] Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers
    Sampson, Jack
    Gonzalez, Ruben
    Collard, Jean-Francois
    Jouppi, Norman P.
    Schlansker, Mike
    Calder, Brad
    [J]. MICRO-39: PROCEEDINGS OF THE 39TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2006, : 235 - +
  • [50] Fine-Grained Exploitation of Mixed Precision for Faster CNN Training
    Johnston, Travis
    Young, Steven R.
    Schuman, Catherine D.
    Chae, Junghoon
    March, Don D.
    Patton, Robert M.
    Potok, Thomas E.
    [J]. PROCEEDINGS OF 2019 5TH IEEE/ACM WORKSHOP ON MACHINE LEARNING IN HIGH PERFORMANCE COMPUTING ENVIRONMENTS (MLHPC 2019), 2019, : 9 - 18