Automated parallel execution of distributed task graphs with FPGA clusters

被引:0
|
作者
Ruiz, Juan Miguel de Haro [1 ,2 ]
Martinez, Carlos alvarez [1 ,2 ]
Jimenez-Gonzalez, Daniel [1 ,2 ]
Martorell, Xavier [1 ,2 ]
Ueno, Tomohiro [3 ]
Sano, Kentaro [3 ]
Ringlein, Burkhard [4 ]
Abel, Francois [4 ]
Weiss, Beat [4 ]
机构
[1] Barcelona Supercomp Ctr, Barcelona, Spain
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Riken Ctr Computat Sci, Kobe, Hyogo, Japan
[4] IBM Res Europe, Zurich, Switzerland
基金
欧盟地平线“2020”; 日本学术振兴会;
关键词
FPGA; MPI; Task graphs; Heterogeneous computing; High performance computing; Programming models; Distributed computing;
D O I
10.1016/j.future.2024.06.041
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
引用
收藏
页码:808 / 824
页数:17
相关论文
共 50 条
  • [31] A framework on task configuration and execution for distributed geographical simulation
    Zhang, Fengyuan
    Chen, Min
    Wang, Ming
    Wang, Zihuan
    Zhang, Shuo
    Yue, Songshan
    Wen, Yongning
    Lu, Guonian
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2021, 14 (09) : 1103 - 1125
  • [32] A Cooperative Approach for Distributed Task Execution in Autonomic Clouds
    Amoretti, Michele
    Lafuente, Alberto Lluch
    Sebastio, Stefano
    PROCEEDINGS OF THE 2013 21ST EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2013, : 274 - 281
  • [33] A remote process creation and execution facility supporting parallel execution on distributed systems
    Hobbs, M
    Goscinski, A
    1996 IEEE SECOND INTERNATIONAL CONFERENCE ON ALGORITHMS & ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP'96, PROCEEDINGS OF, 1996, : 92 - 99
  • [34] Virtual Clusters for Parallel and Distributed Education
    Shoop, Elizabeth
    Brown, Richard
    Biggers, Eric
    Kane, Malcolm
    Lin, Devry
    Warner, Maura
    SIGCSE 12: PROCEEDINGS OF THE 43RD ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, 2011, : 517 - 522
  • [35] Distributed Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA
    Skalicky, Sam
    Lopez, Sonia
    Lukowiak, Marcin
    2013 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 2013,
  • [36] Robust Scheduling of Task Graphs under Execution Time Uncertainty
    Lombardi, Michele
    Milano, Michela
    Benini, Luca
    IEEE TRANSACTIONS ON COMPUTERS, 2013, 62 (01) : 98 - 111
  • [37] Non-strict execution in parallel and distributed computing
    Cristobal-Salas, A
    Tchernykh, A
    Gaudiot, JL
    Lin, WY
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2003, 31 (02) : 77 - 105
  • [38] Efficient clustering for parallel tasks execution in distributed systems
    Zomaya, AY
    Chan, G
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2005, 16 (02) : 281 - 299
  • [39] Non-Strict Execution in Parallel and Distributed Computing
    Alfredo Cristobal-Salas
    Andrei Tchernykh
    Jean-Luc Gaudiot
    Wen-Yen Lin
    International Journal of Parallel Programming, 2003, 31 : 77 - 105
  • [40] From Serial Loops to Parallel Execution on Distributed Systems
    Bosilca, George
    Bouteiller, Aurelien
    Danalis, Anthony
    Herault, Thomas
    Dongarra, Jack
    EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 246 - 257