Compressing Context to Enhance Inference Efficiency of Large Language Models

被引：0

作者：

Li, Yucheng ^{[1
]}

Dong, Bo ^{[1
]}

Guerin, Frank ^{[1
]}

Lin, Chenghua ^{[2
,3
]}

机构：

[1] Univ Surrey, Dept Comp Sci, Guildford, Surrey, England

[2] Univ Manchester, Dept Comp Sci, Manchester, Lancs, England

[3] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50% reduction in context cost, resulting in a 36% reduction in inference memory usage and a 32% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance. Code and data are available at https://github.com/liyucheng09/Selective_Context.

引用

页码：6342 / 6353

页数：12

共 50 条

[31] Implications of Large Language Models for Quality and Efficiency of Neurologic Care
Moura, Lidia
Jones, David T.
Sheikh, Irfan S.
Murphy, Shawn
Kalfin, Michael
Kummer, Benjamin R.
Weathers, Allison L.
Grinspan, Zachary M.
Silsbee, Heather M.
Jones Jr, Lyell K.
Patel, Anup D.
NEUROLOGY, 2024, 102 (11) : e209497
[32] GPTQT: Quantize Large Language Models Twice to Push the Efficiency
Guo, Yipin
Lang, Yilin
Ren, Qinyuan
2024 IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, CIS AND IEEE INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND MECHATRONICS, RAM, CIS-RAM 2024, 2024, : 368 - 373
[33] Layer-Condensed KV Cache for Efficient Inference of Large Language Models
Wu, Haoyi
Tu, Kewei
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11175 - 11188
[34] EchoSwift An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)
Krishna, Karthik
Bandili, Ramana
COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 158 - 162
[35] Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach
Yuan, Xingyu
Li, He
Ota, Kaoru
Dong, Mianxiong
20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 244 - 249
[36] Tabi: An Efficient Multi-Level Inference System for Large Language Models
Wang, Yiding
Chen, Kai
Tan, Haisheng
Guo, Kun
PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 233 - 248
[37] Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
Zhang, Xinyuan
Nie, Jiangtian
Huang, Yudong
Xie, Gaochang
Xiong, Zehui
Liu, Jiang
Niyato, Dusit
Shen, Xuemin
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2025, 24 (01) : 643 - 658
[38] An efficient quantized GEMV implementation for large language models inference with matrix core
Zhang, Yu
Lu, Lu
Zhao, Rong
Guo, Yijie
Yang, Zhanyu
JOURNAL OF SUPERCOMPUTING, 2025, 81 (03):
[39] Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Borzunov, Alexander
Ryabinin, Max
Chumachenko, Artem
Baranchuk, Dmitry
Dettmers, Tim
Belkada, Younes
Samygin, Pavel
Raffel, Colin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[40] Assessing Large Language Models for Oncology Data Inference From Radiology Reports
Chen, Li-Ching
Zack, Travis
Demirci, Arda
Sushil, Madhumita
Miao, Brenda
Kasap, Corynn
Butte, Atul
Collisson, Eric A.
Hong, Julian C.
JCO CLINICAL CANCER INFORMATICS, 2024, 8

← 1 2 3 4 5 →