Automated parallel execution of distributed task graphs with FPGA clusters

被引：0

作者：

Ruiz, Juan Miguel de Haro ^{[1
,2
]}

Martinez, Carlos alvarez ^{[1
,2
]}

Jimenez-Gonzalez, Daniel ^{[1
,2
]}

Martorell, Xavier ^{[1
,2
]}

Ueno, Tomohiro ^{[3
]}

Sano, Kentaro ^{[3
]}

Ringlein, Burkhard ^{[4
]}

Abel, Francois ^{[4
]}

Weiss, Beat ^{[4
]}

机构：

[1] Barcelona Supercomp Ctr, Barcelona, Spain

[2] Univ Politecn Cataluna, Barcelona, Spain

[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan

[4] IBM Res Europe, Zurich, Switzerland

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2024年 / 160卷

基金：

欧盟地平线“2020”; 日本学术振兴会;

关键词：

FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;

D O I：

10.1016/j.future.2024.06.041

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

引用

页码：808 / 824

页数：17

共 50 条

[31] A framework on task configuration and execution for distributed geographical simulation
Zhang, Fengyuan
Chen, Min
Wang, Ming
Wang, Zihuan
Zhang, Shuo
Yue, Songshan
Wen, Yongning
Lu, Guonian
INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2021, 14 (09) : 1103 - 1125
[32] A Cooperative Approach for Distributed Task Execution in Autonomic Clouds
Amoretti, Michele
Lafuente, Alberto Lluch
Sebastio, Stefano
PROCEEDINGS OF THE 2013 21ST EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2013, : 274 - 281
[33] A remote process creation and execution facility supporting parallel execution on distributed systems
Hobbs, M
Goscinski, A
1996 IEEE SECOND INTERNATIONAL CONFERENCE ON ALGORITHMS & ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP'96, PROCEEDINGS OF, 1996, : 92 - 99
[34] Virtual Clusters for Parallel and Distributed Education
Shoop, Elizabeth
Brown, Richard
Biggers, Eric
Kane, Malcolm
Lin, Devry
Warner, Maura
SIGCSE 12: PROCEEDINGS OF THE 43RD ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2011, : 517 - 522
[35] Distributed Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA
Skalicky, Sam
Lopez, Sonia
Lukowiak, Marcin
2013 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2013,
[36] Robust Scheduling of Task Graphs under Execution Time Uncertainty
Lombardi, Michele
Milano, Michela
Benini, Luca
IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (01) : 98 - 111
[37] Non-strict execution in parallel and distributed computing
Cristobal-Salas, A
Tchernykh, A
Gaudiot, JL
Lin, WY
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2003, 31 (02) : 77 - 105
[38] Efficient clustering for parallel tasks execution in distributed systems
Zomaya, AY
Chan, G
INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2005, 16 (02) : 281 - 299
[39] Non-Strict Execution in Parallel and Distributed Computing
Alfredo Cristobal-Salas
Andrei Tchernykh
Jean-Luc Gaudiot
Wen-Yen Lin
International Journal of Parallel Programming, 2003, 31 : 77 - 105
[40] From Serial Loops to Parallel Execution on Distributed Systems
Bosilca, George
Bouteiller, Aurelien
Danalis, Anthony
Herault, Thomas
Dongarra, Jack
EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 246 - 257

← 1 2 3 4 5 →