DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引：22

作者：

Hong, Seongmin ^{[1
]}

Moon, Seungjae ^{[1
]}

Kim, Junsoo ^{[1
]}

Lee, Sungjae ^{[2
]}

Kim, Minsub ^{[2
]}

Lee, Dongsoo ^{[2
]}

Kim, Joo-Young ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea

[2] NAVER CLOVA, Seongnam, South Korea

来源：

2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年

关键词：

Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;

D O I：

10.1109/MICRO56248.2022.00051

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

引用

页码：616 / 630

页数：15

共 50 条

[21] Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation
Wang, Yufei
Xu, Can
Hu, Huang
Tao, Chongyang
Wan, Stephen
Dras, Mark
Johnson, Mark
Jiang, Daxin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[22] Transformer-based active learning for multi-class text annotation and classification
Afzal, Muhammad
Hussain, Jamil
Abbas, Asim
Hussain, Maqbool
Attique, Muhammad
Lee, Sungyoung
DIGITAL HEALTH, 2024, 10
[23] Low-Latency Approach for Secure ECG Feature Based Cryptographic Key Generation
Moosavi, Sanaz Rahimi
Nigussie, Ethiopia
Levorato, Marco
Virtanen, Seppo
Isoaho, Jouni
IEEE ACCESS, 2018, 6 : 428 - 442
[24] A Low-Latency Syndrome-based Deep Learning Decoder Architecture and its FPGA Implementation
Kavvousanos, E.
Paliouras, V
2022 11TH INTERNATIONAL CONFERENCE ON MODERN CIRCUITS AND SYSTEMS TECHNOLOGIES (MOCAST), 2022,
[25] Resolver-to-Digital Converter with Synchronous Demodulation for FPGA based Low-Latency Control Loops
Lidozzi, A.
Sabatini, V.
Bifaretti, S.
Brown, G.
Solero, L.
Crescimbini, F.
2017 19TH EUROPEAN CONFERENCE ON POWER ELECTRONICS AND APPLICATIONS (EPE'17 ECCE EUROPE), 2017,
[26] Energy-Efficient Low-Latency Signed Multiplier for FPGA-Based Hardware Accelerators
Ullah, Salim
Nguyen, Tuan Duy Anh
Kumar, Akash
IEEE EMBEDDED SYSTEMS LETTERS, 2021, 13 (02) : 41 - 44
[27] Low-Latency In Situ Image Analytics With FPGA-Based Quantized Convolutional Neural Network
Wang, Maolin
Lee, Kelvin C. M.
Chung, Bob M. F.
Bogaraju, Sharatchandra Varma
Ng, Ho-Cheung
Wong, Justin S. J.
Shum, Ho Cheung
Tsia, Kevin K.
So, Hayden Kwok-Hay
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (07) : 2853 - 2866
[28] Area-Optimized Low-Latency Approximate Multipliers for FPGA-based Hardware Accelerators
Ullah, Salim
Rehman, Semeen
Prabakaran, Bharath Srinivas
Kriebel, Florian
Hanif, Muhammad Abdullah
Shafique, Muhammad
Kumar, Akash
2018 55TH ACM/ESDA/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2018,
[29] 4-Gbps low-latency FPGA-based underwater wireless optical communication
Zhang, Tianyi
Fei, Chao
Wang, Yuan
Du, Ji
Xie, Yitong
Zhang, Fei
Tian, Jiahan
Zhang, Guowu
Wang, Gaoxuan
Hong, Xiaojian
He, Sailing
OPTICS EXPRESS, 2024, 32 (21): : 36207 - 36222
[30] A Transformer-Based Hierarchical Variational AutoEncoder Combined Hidden Markov Model for Long Text Generation
Zhao, Kun
Ding, Hongwei
Ye, Kai
Cui, Xiaohui
ENTROPY, 2021, 23 (10)

← 1 2 3 4 5 →