Distributed Recommendation Inference on FPGA Clusters

被引：9

作者：

Zhu, Yu ^{[1
]}

He, Zhenhao ^{[1
]}

Jiang, Wenqi ^{[1
]}

Zeng, Kai ^{[2
]}

Zhou, Jingren ^{[2
]}

Alonso, Gustavo ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Syst Grp, Zurich, Switzerland

[2] Alibaba Grp, Hangzhou, Peoples R China

来源：

2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021) | 2021年

关键词：

D O I：

10.1109/FPL53798.2021.00057

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimize one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95x and 7.68x speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds.

引用

页码：279 / 285

页数：7

共 50 条

[1] FleetRec: Large-Scale Recommendation Inference on Hybrid GPU-FPGA Clusters
Jiang, Wenqi
He, Zhenhao
Zhang, Shuai
Zeng, Kai
Feng, Liang
Zhang, Jiansong
Liu, Tongxuan
Li, Yong
Zhou, Jingren
Zhang, Ce
Alonso, Gustavo
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3097 - 3105
[2] Challenges using FPGA Clusters for Distributed CNN Training
Kreowsky, Philipp
Knapheide, Justin
Stabernack, Benno
2023 33RD INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2023, : 347 - 348
[3] An Approach Towards Distributed DNN Training on FPGA Clusters
Kreowsky, Philipp
Knapheide, Justin
Stabernack, Benno
ARCHITECTURE OF COMPUTING SYSTEMS, ARCS 2024, 2024, 14842 : 18 - 32
[4] Inference of global clusters from locally distributed data
Nguyen, XuanLong
BAYESIAN ANALYSIS, 2010, 5 (04): : 817 - 845
[5] Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles
Owaida, Muhsen
Alonso, Gustavo
2018 28TH INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 2018, : 295 - 300
[6] Demonstrating NADA: A Workflow for Distributed CNN Training on FPGA Clusters
Knapheide, Justin
Kreowsky, Philipp
Stabernack, Benno
2023 33RD INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2023, : 363 - 363
[7] Automated parallel execution of distributed task graphs with FPGA clusters
Ruiz, Juan Miguel de Haro
Martinez, Carlos alvarez
Jimenez-Gonzalez, Daniel
Martorell, Xavier
Ueno, Tomohiro
Sano, Kentaro
Ringlein, Burkhard
Abel, Francois
Weiss, Beat
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 160 : 808 - 824
[8] Distributed Inference over Decision Tree Ensembles on Clusters of FPGAs
Owaida, Muhsen
Kulkarni, Amit
Alonso, Gustavo
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2019, 12 (04)
[9] Scaling up Bayesian variational inference using distributed computing clusters
Masegosa, Andres R.
Martinez, Ana M.
Langseth, Helge
Nielsen, Thomas D.
Salmeron, Antonio
Ramos-Lopez, Dario
Madsen, Anders L.
INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2017, 88 : 435 - 451
[10] DISSEC: A distributed deep neural network inference scheduling strategy for edge clusters
Li, Qiang
Huang, Liang
Tong, Zhao
Du, Ting-Ting
Zhang, Jin
Wang, Sheng-Chun
NEUROCOMPUTING, 2022, 500 (449-460) : 449 - 460

← 1 2 3 4 5 →