TCB: Accelerating Transformer Inference Services with Request Concatenation

被引:5
|
作者
Fu, Boqian [1 ]
Chen, Fahao [1 ]
Li, Peng [1 ]
Zeng, Deze [2 ]
机构
[1] Univ Aizu, Aizu Wakamatsu, Japan
[2] China Univ Geosci, Wuhan, Peoples R China
关键词
Transformers; inference; scheduling; online algorithm;
D O I
10.1145/3545008.3545052
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel Concat-Batching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Accelerating Transformer Inference for Translation via Parallel Decoding
    Santilli, Andrea
    Severino, Silvio
    Postolache, Emilian
    Maiorca, Valentino
    Mancusi, Michele
    Marin, Riccardo
    Rodola, Emanuele
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12336 - 12355
  • [2] Chemical transformer compression for accelerating both training and inference of molecular modeling
    Yu, Yi
    Borjesson, Karl
    [J]. MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2022, 3 (04):
  • [3] Preventing Denial-of-request Inference Attacks in Location-sharing Services
    Minami, Kazuhiro
    [J]. 2014 SEVENTH INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND UBIQUITOUS NETWORKING (ICMU), 2014, : 50 - 55
  • [4] Type inference for record concatenation and subtyping
    Palsberg, J
    Zhao, T
    [J]. INFORMATION AND COMPUTATION, 2004, 189 (01) : 54 - 86
  • [5] Efficient type inference for record concatenation and subtyping
    Palsberg, J
    Zhao, T
    [J]. 17TH ANNUAL IEEE SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, PROCEEDINGS, 2002, : 125 - 136
  • [6] TYPE INFERENCE FOR RECORD CONCATENATION AND MULTIPLE INHERITANCE
    WAND, M
    [J]. INFORMATION AND COMPUTATION, 1991, 93 (01) : 1 - 15
  • [7] An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference
    Lu, Zhaojun
    Wang, Xueyan
    Arafin, Md Tanvir
    Yang, Haoxiang
    Liu, Zhenglin
    Zhang, Jiliang
    Qu, Gang
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (03) : 485 - 496
  • [8] Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models
    Choi, Jaewan
    Park, Jaehyun
    Kyung, Kwanhee
    Kim, Nam Sung
    Ahn, Jung Ho
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2023, 22 (02) : 113 - 116
  • [9] Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models
    Choi, Jaewan
    Park, Jaehyun
    Kyung, Kwanhee
    Kim, Nam Sung
    Ahn, Jung Ho
    [J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024, 2024, : 614 - 614
  • [10] TYPE INFERENCE FOR RECORD CONCATENATION AND MULTIPLE INHERITANCE
    WAND, M
    [J]. FOURTH ANNUAL SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, 1989, : 92 - 97