TCB: Accelerating Transformer Inference Services with Request Concatenation

被引：5

作者：

Fu, Boqian ^{[1
]}

Chen, Fahao ^{[1
]}

Li, Peng ^{[1
]}

Zeng, Deze ^{[2
]}

机构：

[1] Univ Aizu, Aizu Wakamatsu, Japan

[2] China Univ Geosci, Wuhan, Peoples R China

来源：

51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022 | 2022年

关键词：

Transformers; inference; scheduling; online algorithm;

D O I：

10.1145/3545008.3545052

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Transformer has dominated the field of natural language processing because of its strong capability in learning from sequential input data. In recent years, various computing and networking optimizations have been proposed for improving transformer training efficiency. However, transformer inference, as the core of many AI services, has been seldom studied. A key challenge of transformer inference is variable-length input. In order to align these input, existing work has proposed batching schemes by padding zeros, which unfortunately introduces significant computational redundancy. Moreover, existing transformer inference studies are separated from the whole serving system, where both request batching and request scheduling are critical and they have complex interaction. To fill the research gap, we propose TCB, a Transformer inference system with a novel Concat-Batching scheme as well as a jointly designed online scheduling algorithm. ConcatBatching minimizes computational redundancy by concatenating multiple requests, so that batch rows can be aligned with reduced padded zeros. Moreover, we conduct a systemic study by designing an online request scheduling algorithm aware of ConcatBatching. This scheduling algorithm needs no future request information and has provable theoretical guarantee. Experimental results show that TCB can significantly outperform state-of-the-art.

引用

页数：11

共 50 条

[1] Accelerating Transformer Inference for Translation via Parallel Decoding
Santilli, Andrea
Severino, Silvio
Postolache, Emilian
Maiorca, Valentino
Mancusi, Michele
Marin, Riccardo
Rodola, Emanuele
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12336 - 12355
[2] Chemical transformer compression for accelerating both training and inference of molecular modeling
Yu, Yi
Borjesson, Karl
[J]. MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2022, 3 (04):
[3] Preventing Denial-of-request Inference Attacks in Location-sharing Services
Minami, Kazuhiro
[J]. 2014 SEVENTH INTERNATIONAL CONFERENCE ON MOBILE COMPUTING AND UBIQUITOUS NETWORKING (ICMU), 2014, : 50 - 55
[4] Type inference for record concatenation and subtyping
Palsberg, J
Zhao, T
[J]. INFORMATION AND COMPUTATION, 2004, 189 (01) : 54 - 86
[5] Efficient type inference for record concatenation and subtyping
Palsberg, J
Zhao, T
[J]. 17TH ANNUAL IEEE SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, PROCEEDINGS, 2002, : 125 - 136
[6] TYPE INFERENCE FOR RECORD CONCATENATION AND MULTIPLE INHERITANCE
WAND, M
[J]. INFORMATION AND COMPUTATION, 1991, 93 (01) : 1 - 15
[7] An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference
Lu, Zhaojun
Wang, Xueyan
Arafin, Md Tanvir
Yang, Haoxiang
Liu, Zhenglin
Zhang, Jiliang
Qu, Gang
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (03) : 485 - 496
[8] Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models
Choi, Jaewan
Park, Jaehyun
Kyung, Kwanhee
Kim, Nam Sung
Ahn, Jung Ho
[J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2023, 22 (02) : 113 - 116
[9] Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models
Choi, Jaewan
Park, Jaehyun
Kyung, Kwanhee
Kim, Nam Sung
Ahn, Jung Ho
[J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024, 2024, : 614 - 614
[10] TYPE INFERENCE FOR RECORD CONCATENATION AND MULTIPLE INHERITANCE
WAND, M
[J]. FOURTH ANNUAL SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, 1989, : 92 - 97

← 1 2 3 4 5 →