DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引：22

作者：

Hong, Seongmin ^{[1
]}

Moon, Seungjae ^{[1
]}

Kim, Junsoo ^{[1
]}

Lee, Sungjae ^{[2
]}

Kim, Minsub ^{[2
]}

Lee, Dongsoo ^{[2
]}

Kim, Joo-Young ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea

[2] NAVER CLOVA, Seongnam, South Korea

来源：

2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年

关键词：

Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;

D O I：

10.1109/MICRO56248.2022.00051

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

引用

页码：616 / 630

页数：15

共 50 条

[1] LTrans-OPU: A Low-Latency FPGA-based Overlay Processor for Transformer Networks
Bai, Yueyin
Zhou, Hao
Zhao, Keqing
Zhang, Manting
Chen, Jianli
Yu, Jun
Wang, Kun
2023 33RD INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2023, : 283 - 287
[2] Transformer-based Question Text Generation in the Learning System
Li, Jiajun
Song, Huazhu
Li, Jun
6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, : 50 - 56
[3] Applying Transformer-Based Text Summarization for Keyphrase Generation
Glazkova A.V.
Morozov D.A.
Lobachevskii Journal of Mathematics, 2023, 44 (1) : 123 - 136
[4] A transformer-based approach to Nigerian Pidgin text generation
Garba, Kabir
Kolajo, Taiwo
Agbogun, Joshua B.
International Journal of Speech Technology, 2024, 27 (04) : 1027 - 1037
[5] Low-latency FPGA Based Financial Data Feed Handler
Pottathuparambil, Robin
Coyne, Jack
Allred, Jeffrey
Lynch, William
Natoli, Vincent
2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 93 - 96
[6] FPGA Based Low-Latency Market Data Feed Handler
Zhou, Liyuan
Jiang, Jiang
Liao, Ruochen
Yang, Tianyi
Wang, Chang
COMPUTER ENGINEERING AND TECHNOLOGY, NCCET 2014, 2015, 491 : 69 - 77
[7] A low-latency LSTM accelerator using balanced sparsity based on FPGA
Jiang, Jingfei
Xiao, Tao
Xu, Jinwei
Wen, Dong
Gao, Lei
Dou, Yong
MICROPROCESSORS AND MICROSYSTEMS, 2022, 89
[8] An FPGA-Based Low-Latency Network Processing for Spark Streaming
Nakamura, Kohei
Hayashi, Ami
Matsutani, Hiroki
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2410 - 2415
[9] PlasticNet: A low latency flexible network architecture for interconnected multi-FPGA systems<bold> </bold>
Salazar-Garcia, Carlos
Gonzalez-Gomez, Jeferson
Alfaro-Badilla, Kaleb
Garcia-Ramirez, Ronny
Rimolo-Donadio, Renato
Strydis, Christos
Chacon-Rodriguez, Alfonso
2020 IEEE 3RD CONFERENCE ON PHD RESEARCH IN MICROELECTRONICS AND ELECTRONICS IN LATIN AMERICA (PRIME-LA), 2020,
[10] A Low Latency Parallel Bus Interface for High-Speed multi-FPGA RT-Simulations
Difronzo, Michele
Ginn, Herbert L.
Benigni, Andrea
2021 IEEE ELECTRIC SHIP TECHNOLOGIES SYMPOSIUM (ESTS), 2021,

← 1 2 3 4 5 →