Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
|
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
基金
欧盟地平线“2020”; 日本学术振兴会;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 50 条
  • [1] Hardware Implementation on FPGA for Task-Level Parallel Dataflow Execution Engine
    Wang, Chao
    Zhang, Junneng
    Li, Xi
    Wang, Aili
    Zhou, Xuehai
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (08) : 2303 - 2315
  • [2] Scheduling task graphs for execution in dynamic SMP clusters with bounded number of resources
    Masko, Lukasz
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2006, 3911 : 871 - 878
  • [3] Toward Bounds on Parallel Execution Times of Task Graphs on Multicores With Memory Constraints
    Song, Jiangong
    Li, Qinyong
    Ma, Shilong
    IEEE ACCESS, 2019, 7 : 52778 - 52789
  • [4] Visualization of parallel execution graphs
    Steckelbach, B
    Bubeck, T
    Fössmeier, U
    Kaufmann, M
    Ritt, M
    Rosenstiel, W
    GRAPH DRAWING, 1998, 1547 : 403 - 412
  • [5] A Fast Distributed Auction and Consensus Process Using Parallel Task Allocation and Execution
    Das, G. P.
    McGinnity, T. M.
    Coleman, S. A.
    Behera, L.
    2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011,
  • [6] Automated Analysis of Task-Parallel Execution Behavior via Artificial Neural Networks
    Neill, Richard
    Drebes, Andi
    Pop, Antoniu
    2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 647 - 656
  • [7] PARALLEL TASK EXECUTION IN A DECENTRALIZED SYSTEM
    GONZALEZ, MJ
    RAMAMOORTHY, CV
    IEEE TRANSACTIONS ON COMPUTERS, 1972, C 21 (12) : 1310 - 1322
  • [8] Task Execution in Distributed Smart Systems
    Jaenen, Uwe
    Grenz, Carsten
    Edenhofer, Sarah
    Stein, Anthony
    Brehm, Juergen
    Haehner, Joerg
    INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 103 - 117
  • [9] Modeling clustered task graphs for scheduling large parallel programs in distributed systems
    Roig, C
    Ripoll, A
    Luque, E
    SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2004, 80 (4-5): : 243 - 254
  • [10] Distributed Submodular Maximization with Parallel Execution
    Sun, Haoyuan
    Grimsman, David
    Marden, Jason R.
    2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 1477 - 1482