DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

被引:22
|
作者
Hong, Seongmin [1 ]
Moon, Seungjae [1 ]
Kim, Junsoo [1 ]
Lee, Sungjae [2 ]
Kim, Minsub [2 ]
Lee, Dongsoo [2 ]
Kim, Joo-Young [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] NAVER CLOVA, Seongnam, South Korea
关键词
Natural Language Processing; GPT; Text Generation; Datacenter; Multi-FPGA Acceleration; Model Parallelism;
D O I
10.1109/MICRO56248.2022.00051
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pretrained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99 x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21 x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.
引用
收藏
页码:616 / 630
页数:15
相关论文
共 50 条
  • [1] LTrans-OPU: A Low-Latency FPGA-based Overlay Processor for Transformer Networks
    Bai, Yueyin
    Zhou, Hao
    Zhao, Keqing
    Zhang, Manting
    Chen, Jianli
    Yu, Jun
    Wang, Kun
    2023 33RD INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2023, : 283 - 287
  • [2] Transformer-based Question Text Generation in the Learning System
    Li, Jiajun
    Song, Huazhu
    Li, Jun
    6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, : 50 - 56
  • [3] Applying Transformer-Based Text Summarization for Keyphrase Generation
    Glazkova A.V.
    Morozov D.A.
    Lobachevskii Journal of Mathematics, 2023, 44 (1) : 123 - 136
  • [4] A transformer-based approach to Nigerian Pidgin text generation
    Garba, Kabir
    Kolajo, Taiwo
    Agbogun, Joshua B.
    International Journal of Speech Technology, 2024, 27 (04) : 1027 - 1037
  • [5] Low-latency FPGA Based Financial Data Feed Handler
    Pottathuparambil, Robin
    Coyne, Jack
    Allred, Jeffrey
    Lynch, William
    Natoli, Vincent
    2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 93 - 96
  • [6] FPGA Based Low-Latency Market Data Feed Handler
    Zhou, Liyuan
    Jiang, Jiang
    Liao, Ruochen
    Yang, Tianyi
    Wang, Chang
    COMPUTER ENGINEERING AND TECHNOLOGY, NCCET 2014, 2015, 491 : 69 - 77
  • [7] A low-latency LSTM accelerator using balanced sparsity based on FPGA
    Jiang, Jingfei
    Xiao, Tao
    Xu, Jinwei
    Wen, Dong
    Gao, Lei
    Dou, Yong
    MICROPROCESSORS AND MICROSYSTEMS, 2022, 89
  • [8] An FPGA-Based Low-Latency Network Processing for Spark Streaming
    Nakamura, Kohei
    Hayashi, Ami
    Matsutani, Hiroki
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 2410 - 2415
  • [9] PlasticNet: A low latency flexible network architecture for interconnected multi-FPGA systems<bold> </bold>
    Salazar-Garcia, Carlos
    Gonzalez-Gomez, Jeferson
    Alfaro-Badilla, Kaleb
    Garcia-Ramirez, Ronny
    Rimolo-Donadio, Renato
    Strydis, Christos
    Chacon-Rodriguez, Alfonso
    2020 IEEE 3RD CONFERENCE ON PHD RESEARCH IN MICROELECTRONICS AND ELECTRONICS IN LATIN AMERICA (PRIME-LA), 2020,
  • [10] A Low Latency Parallel Bus Interface for High-Speed multi-FPGA RT-Simulations
    Difronzo, Michele
    Ginn, Herbert L.
    Benigni, Andrea
    2021 IEEE ELECTRIC SHIP TECHNOLOGIES SYMPOSIUM (ESTS), 2021,