Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

欧盟地平线“2020”; 日本学术振兴会;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 50 条

[1] Hardware Implementation on FPGA for Task-Level Parallel Dataflow Execution Engine
Wang, Chao
Zhang, Junneng
Li, Xi
Wang, Aili
Zhou, Xuehai
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (08) : 2303 - 2315
[2] Scheduling task graphs for execution in dynamic SMP clusters with bounded number of resources
Masko, Lukasz
PARALLEL PROCESSING AND APPLIED MATHEMATICS, 2006, 3911 : 871 - 878
[3] Toward Bounds on Parallel Execution Times of Task Graphs on Multicores With Memory Constraints
Song, Jiangong
Li, Qinyong
Ma, Shilong
IEEE ACCESS, 2019, 7 : 52778 - 52789
[4] Visualization of parallel execution graphs
Steckelbach, B
Bubeck, T
Fössmeier, U
Kaufmann, M
Ritt, M
Rosenstiel, W
GRAPH DRAWING, 1998, 1547 : 403 - 412
[5] A Fast Distributed Auction and Consensus Process Using Parallel Task Allocation and Execution
Das, G. P.
McGinnity, T. M.
Coleman, S. A.
Behera, L.
2011 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, 2011,
[6] Automated Analysis of Task-Parallel Execution Behavior via Artificial Neural Networks
Neill, Richard
Drebes, Andi
Pop, Antoniu
2018 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2018), 2018, : 647 - 656
[7] PARALLEL TASK EXECUTION IN A DECENTRALIZED SYSTEM
GONZALEZ, MJ
RAMAMOORTHY, CV
IEEE TRANSACTIONS ON COMPUTERS, 1972, C 21 (12) : 1310 - 1322
[8] Task Execution in Distributed Smart Systems
Jaenen, Uwe
Grenz, Carsten
Edenhofer, Sarah
Stein, Anthony
Brehm, Juergen
Haehner, Joerg
INTERNET AND DISTRIBUTED COMPUTING SYSTEMS, IDCS 2015, 2015, 9258 : 103 - 117
[9] Modeling clustered task graphs for scheduling large parallel programs in distributed systems
Roig, C
Ripoll, A
Luque, E
SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2004, 80 (4-5): : 243 - 254
[10] Distributed Submodular Maximization with Parallel Execution
Sun, Haoyuan
Grimsman, David
Marden, Jason R.
2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 1477 - 1482

← 1 2 3 4 5 →