Loop Parallelization Techniques for FPGA Accelerator Synthesis

被引:3
|
作者
Reiche, Oliver [1 ]
Ozkan, M. Akif [1 ]
Hannig, Frank [1 ]
Teich, Juergen [1 ]
Schmid, Moritz [2 ]
机构
[1] Friedrich Alexander Univ, Hardware Software Codesign, Dept Comp Sci, Erlangen Nurnberg FAU, Cauerstr 11, D-91054 Erlangen, Germany
[2] Siemens Healthcare GmbH, Adv Therapies Business Unit, R&D, Forchheim, Germany
关键词
Altera OpenCL; Vivado HLS; Vectorization; Loop coarsening; Loop tiling; FLOW;
D O I
10.1007/s11265-017-1229-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP). The support for Data-Level Parallelism (DLP), one of the key advantages of Field programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines. In addition to well-known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows processing multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We present concrete implementations of tiling and coarsening for Vivado HLS and Altera OpenCL. Furthermore, we present a comparison of our implementations to the keyword-driven parallelization support provided by the Altera Offline Compiler. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc to generate loop coarsening implementations for Vivado HLS and Altera OpenCL. Moreover, we compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPU), all generated from exactly the same code base.
引用
收藏
页码:3 / 27
页数:25
相关论文
共 50 条
  • [31] Overcoming the limitations of the traditional loop parallelization
    Karkowski, I
    Corporaal, H
    FUTURE GENERATION COMPUTER SYSTEMS, 1998, 13 (4-5) : 407 - 416
  • [32] Parallelization techniques for tabu search
    Dabrowski, Jacek
    Applied Parallel Computing: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2007, 4699 : 1126 - 1135
  • [33] Profiling dependence vectors for loop parallelization
    Tseng, SY
    King, CT
    Tang, CY
    10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM - PROCEEDINGS OF IPPS '96, 1996, : 23 - 27
  • [34] Parallelization techniques for numerical modelling
    Alefeld, G
    Mahrenholtz, O
    Vollmar, R
    PARALLEL COMPUTING, 1999, 25 (07) : 775 - 776
  • [35] Efficient CNN Accelerator on FPGA
    Kala, S.
    Nalesh, S.
    IETE JOURNAL OF RESEARCH, 2020, 66 (06) : 733 - 740
  • [36] Automatic Synthesis of FPGA Processor Arrays from Loop Algorithms
    Marcus Bednara
    Jürgen Teich
    The Journal of Supercomputing, 2003, 26 : 149 - 165
  • [37] Automatic synthesis of FPGA processor arrays from loop algorithms
    Bednara, M
    Teich, J
    JOURNAL OF SUPERCOMPUTING, 2003, 26 (02): : 149 - 165
  • [38] HDL and Design Techniques Analysis for FPGA & ASIC Synthesis
    Ceminari, Paola
    Oroz De Gaetano, Ariel
    Bellini, Jorge
    Di Federico, Martin
    2017 1ST IEEE CONFERENCE ON PHD RESEARCH IN MICROELECTRONICS AND ELECTRONICS LATIN AMERICA (PRIME-LA), 2017, : 33 - 36
  • [39] A CNN Inference Accelerator on FPGA With Compression and Layer-Chaining Techniques for Style Transfer Applications
    Kim, Suchang
    Jang, Boseon
    Lee, Jaeyoung
    Bae, Hyungjoon
    Jang, Hyejung
    Park, In-Cheol
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2023, 70 (04) : 1591 - 1604
  • [40] An object-oriented framework for loop parallelization
    Omori, Y
    Fukuda, A
    JOURNAL OF SUPERCOMPUTING, 1999, 13 (01): : 57 - 69