DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引：22

作者：

Hong, Seongmin ^{[1
]}

Moon, Seungjae ^{[1
]}

Kim, Junsoo ^{[1
]}

Lee, Sungjae ^{[2
]}

Kim, Minsub ^{[2
]}

Lee, Dongsoo ^{[2
]}

Kim, Joo-Young ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea

[2] NAVER CLOVA, Seongnam, South Korea

来源：

2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年

关键词：

Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;

D O I：

10.1109/MICRO56248.2022.00051

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

引用

页码：616 / 630

页数：15

共 50 条

[41] Automated ICD coding using extreme multi-label long text transformer-based models
Liu, Leibo
Perez-Concha, Oscar
Nguyen, Anthony
Bennett, Vicki
Jorm, Louisa
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
[42] A Low-Latency and High-Performance Microwave Photonic AOA and IFM System Based on Deep Learning and FPGA
Zhang, Longlong
Li, Yin
Liao, Xuan
Hu, Xiang
Peng, Yuanxi
Zhou, Tong
IEEE SENSORS JOURNAL, 2025, 25 (06) : 9934 - 9945
[43] Pruning Binarized Neural Networks Enables Low-Latency, Low-Power FPGA-Based Handwritten Digit Classification
Payra, Syamantak
Loke, Gabriel
Fink, Yoel
Steinmeyer, Joseph D.
2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
[44] An Effective Capacity-Based Approach to Multi-Channel Low-Latency Wireless Communications
Choi, Jinho
IEEE TRANSACTIONS ON COMMUNICATIONS, 2019, 67 (03) : 2476 - 2486
[45] A Low-Latency TDMA Scheduler for Multi-hop Cluster Based MANETs with Directional Antennas
Iannacone, Michael
Al-Mousa, Yamin
Martin, Nicholas
Shenoy, Nirmala
Fischer, John
AD HOC NETWORKS, 2010, 28 : 896 - 912
[46] Multi-Scenario Channel Parameter Generation with Transformer-Based Conditional Generative Adversarial Network
Lv, Haozheng
Bian, Ji
Wang, Cheng-Xiang
Zheng, Xiangwei
Tian, Jie
Liu, Yu
2024 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA, ICCC, 2024,
[47] Transformer-based Multi-Party Conversation Generation using Dialogue Discourse Acts Planning
Chernyavskiy, Alexander
Ilvovsky, Dmitry
24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 519 - 529
[48] Compact and Low-Latency FPGA-Based Number Theoretic Transform Architecture for CRYSTALS Kyber Postquantum Cryptography Scheme
Kieu-Do-Nguyen, Binh
Binh, Nguyen The
Pham-Quoc, Cuong
Nghi, Huynh Phuc
Tran, Ngoc-Thinh
Hoang, Trong-Thuc
Pham, Cong-Kha
INFORMATION, 2024, 15 (07)
[49] Transformer-Based RIS Phase Shift Control for Ultra-Low Latency V2X Systems
Kim, Hyunsoo
Kim, Seungnyun
Wu, Jiao
Shim, Byonghyo
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (04) : 1414 - 1429
[50] Leveraging Transformer-based autoencoders for low-rank multi-view subspace clustering
Lin, Yuxiu
Liu, Hui
Yu, Xiao
Zhang, Caiming
PATTERN RECOGNITION, 2025, 161

← 1 2 3 4 5 →