DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引:22
|
作者
Hong, Seongmin [1 ]
Moon, Seungjae [1 ]
Kim, Junsoo [1 ]
Lee, Sungjae [2 ]
Kim, Minsub [2 ]
Lee, Dongsoo [2 ]
Kim, Joo-Young [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] NAVER CLOVA, Seongnam, South Korea
来源
2022 55TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO) | 2022年
关键词
Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;
D O I
10.1109/MICRO56248.2022.00051
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.
引用
收藏
页码:616 / 630
页数:15
相关论文
共 50 条
  • [41] Automated ICD coding using extreme multi-label long text transformer-based models
    Liu, Leibo
    Perez-Concha, Oscar
    Nguyen, Anthony
    Bennett, Vicki
    Jorm, Louisa
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [42] A Low-Latency and High-Performance Microwave Photonic AOA and IFM System Based on Deep Learning and FPGA
    Zhang, Longlong
    Li, Yin
    Liao, Xuan
    Hu, Xiang
    Peng, Yuanxi
    Zhou, Tong
    IEEE SENSORS JOURNAL, 2025, 25 (06) : 9934 - 9945
  • [43] Pruning Binarized Neural Networks Enables Low-Latency, Low-Power FPGA-Based Handwritten Digit Classification
    Payra, Syamantak
    Loke, Gabriel
    Fink, Yoel
    Steinmeyer, Joseph D.
    2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
  • [44] An Effective Capacity-Based Approach to Multi-Channel Low-Latency Wireless Communications
    Choi, Jinho
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2019, 67 (03) : 2476 - 2486
  • [45] A Low-Latency TDMA Scheduler for Multi-hop Cluster Based MANETs with Directional Antennas
    Iannacone, Michael
    Al-Mousa, Yamin
    Martin, Nicholas
    Shenoy, Nirmala
    Fischer, John
    AD HOC NETWORKS, 2010, 28 : 896 - 912
  • [46] Multi-Scenario Channel Parameter Generation with Transformer-Based Conditional Generative Adversarial Network
    Lv, Haozheng
    Bian, Ji
    Wang, Cheng-Xiang
    Zheng, Xiangwei
    Tian, Jie
    Liu, Yu
    2024 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA, ICCC, 2024,
  • [47] Transformer-based Multi-Party Conversation Generation using Dialogue Discourse Acts Planning
    Chernyavskiy, Alexander
    Ilvovsky, Dmitry
    24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 519 - 529
  • [48] Compact and Low-Latency FPGA-Based Number Theoretic Transform Architecture for CRYSTALS Kyber Postquantum Cryptography Scheme
    Kieu-Do-Nguyen, Binh
    Binh, Nguyen The
    Pham-Quoc, Cuong
    Nghi, Huynh Phuc
    Tran, Ngoc-Thinh
    Hoang, Trong-Thuc
    Pham, Cong-Kha
    INFORMATION, 2024, 15 (07)
  • [49] Transformer-Based RIS Phase Shift Control for Ultra-Low Latency V2X Systems
    Kim, Hyunsoo
    Kim, Seungnyun
    Wu, Jiao
    Shim, Byonghyo
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (04) : 1414 - 1429
  • [50] Leveraging Transformer-based autoencoders for low-rank multi-view subspace clustering
    Lin, Yuxiu
    Liu, Hui
    Yu, Xiao
    Zhang, Caiming
    PATTERN RECOGNITION, 2025, 161