DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引:22
|
作者
Hong, Seongmin [1 ]
Moon, Seungjae [1 ]
Kim, Junsoo [1 ]
Lee, Sungjae [2 ]
Kim, Minsub [2 ]
Lee, Dongsoo [2 ]
Kim, Joo-Young [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] NAVER CLOVA, Seongnam, South Korea
来源
2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年
关键词
Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;
D O I
10.1109/MICRO56248.2022.00051
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.
引用
收藏
页码:616 / 630
页数:15
相关论文
共 50 条
  • [21] Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation
    Wang, Yufei
    Xu, Can
    Hu, Huang
    Tao, Chongyang
    Wan, Stephen
    Dras, Mark
    Johnson, Mark
    Jiang, Daxin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [22] Transformer-based active learning for multi-class text annotation and classification
    Afzal, Muhammad
    Hussain, Jamil
    Abbas, Asim
    Hussain, Maqbool
    Attique, Muhammad
    Lee, Sungyoung
    DIGITAL HEALTH, 2024, 10
  • [23] Low-Latency Approach for Secure ECG Feature Based Cryptographic Key Generation
    Moosavi, Sanaz Rahimi
    Nigussie, Ethiopia
    Levorato, Marco
    Virtanen, Seppo
    Isoaho, Jouni
    IEEE ACCESS, 2018, 6 : 428 - 442
  • [24] A Low-Latency Syndrome-based Deep Learning Decoder Architecture and its FPGA Implementation
    Kavvousanos, E.
    Paliouras, V
    2022 11TH INTERNATIONAL CONFERENCE ON MODERN CIRCUITS AND SYSTEMS TECHNOLOGIES (MOCAST), 2022,
  • [25] Resolver-to-Digital Converter with Synchronous Demodulation for FPGA based Low-Latency Control Loops
    Lidozzi, A.
    Sabatini, V.
    Bifaretti, S.
    Brown, G.
    Solero, L.
    Crescimbini, F.
    2017 19TH EUROPEAN CONFERENCE ON POWER ELECTRONICS AND APPLICATIONS (EPE'17 ECCE EUROPE), 2017,
  • [26] Energy-Efficient Low-Latency Signed Multiplier for FPGA-Based Hardware Accelerators
    Ullah, Salim
    Nguyen, Tuan Duy Anh
    Kumar, Akash
    IEEE EMBEDDED SYSTEMS LETTERS, 2021, 13 (02) : 41 - 44
  • [27] Low-Latency In Situ Image Analytics With FPGA-Based Quantized Convolutional Neural Network
    Wang, Maolin
    Lee, Kelvin C. M.
    Chung, Bob M. F.
    Bogaraju, Sharatchandra Varma
    Ng, Ho-Cheung
    Wong, Justin S. J.
    Shum, Ho Cheung
    Tsia, Kevin K.
    So, Hayden Kwok-Hay
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (07) : 2853 - 2866
  • [28] Area-Optimized Low-Latency Approximate Multipliers for FPGA-based Hardware Accelerators
    Ullah, Salim
    Rehman, Semeen
    Prabakaran, Bharath Srinivas
    Kriebel, Florian
    Hanif, Muhammad Abdullah
    Shafique, Muhammad
    Kumar, Akash
    2018 55TH ACM/ESDA/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2018,
  • [29] 4-Gbps low-latency FPGA-based underwater wireless optical communication
    Zhang, Tianyi
    Fei, Chao
    Wang, Yuan
    Du, Ji
    Xie, Yitong
    Zhang, Fei
    Tian, Jiahan
    Zhang, Guowu
    Wang, Gaoxuan
    Hong, Xiaojian
    He, Sailing
    OPTICS EXPRESS, 2024, 32 (21): : 36207 - 36222
  • [30] A Transformer-Based Hierarchical Variational AutoEncoder Combined Hidden Markov Model for Long Text Generation
    Zhao, Kun
    Ding, Hongwei
    Ye, Kai
    Cui, Xiaohui
    ENTROPY, 2021, 23 (10)