Compressing Context to Enhance Inference Efficiency of Large Language Models

被引:0
|
作者
Li, Yucheng [1 ]
Dong, Bo [1 ]
Guerin, Frank [1 ]
Lin, Chenghua [2 ,3 ]
机构
[1] Univ Surrey, Dept Comp Sci, Guildford, Surrey, England
[2] Univ Manchester, Dept Comp Sci, Manchester, Lancs, England
[3] Univ Sheffield, Dept Comp Sci, Sheffield, S Yorkshire, England
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) achieved remarkable performance across various tasks. However, they face challenges in managing long documents and extended conversations, due to significantly increased computational requirements, both in memory and inference time, and potential context truncation when the input exceeds the LLM's fixed context length. This paper proposes a method called Selective Context that enhances the inference efficiency of LLMs by identifying and pruning redundancy in the input context to make the input more compact. We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations, on tasks of summarisation, question answering, and response generation. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency while maintaining comparable performance compared to that achieved when full context is used. Specifically, we achieve a 50% reduction in context cost, resulting in a 36% reduction in inference memory usage and a 32% reduction in inference time, while observing only a minor drop of .023 in BERTscore and .038 in faithfulness on four downstream applications, indicating that our method strikes a good balance between efficiency and performance. Code and data are available at https://github.com/liyucheng09/Selective_Context.
引用
收藏
页码:6342 / 6353
页数:12
相关论文
共 50 条
  • [31] Implications of Large Language Models for Quality and Efficiency of Neurologic Care
    Moura, Lidia
    Jones, David T.
    Sheikh, Irfan S.
    Murphy, Shawn
    Kalfin, Michael
    Kummer, Benjamin R.
    Weathers, Allison L.
    Grinspan, Zachary M.
    Silsbee, Heather M.
    Jones Jr, Lyell K.
    Patel, Anup D.
    NEUROLOGY, 2024, 102 (11) : e209497
  • [32] GPTQT: Quantize Large Language Models Twice to Push the Efficiency
    Guo, Yipin
    Lang, Yilin
    Ren, Qinyuan
    2024 IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS AND INTELLIGENT SYSTEMS, CIS AND IEEE INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND MECHATRONICS, RAM, CIS-RAM 2024, 2024, : 368 - 373
  • [33] Layer-Condensed KV Cache for Efficient Inference of Large Language Models
    Wu, Haoyi
    Tu, Kewei
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11175 - 11188
  • [34] EchoSwift An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)
    Krishna, Karthik
    Bandili, Ramana
    COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 158 - 162
  • [35] Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach
    Yuan, Xingyu
    Li, He
    Ota, Kaoru
    Dong, Mianxiong
    20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 244 - 249
  • [36] Tabi: An Efficient Multi-Level Inference System for Large Language Models
    Wang, Yiding
    Chen, Kai
    Tan, Haisheng
    Guo, Kun
    PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 233 - 248
  • [37] Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
    Zhang, Xinyuan
    Nie, Jiangtian
    Huang, Yudong
    Xie, Gaochang
    Xiong, Zehui
    Liu, Jiang
    Niyato, Dusit
    Shen, Xuemin
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2025, 24 (01) : 643 - 658
  • [38] An efficient quantized GEMV implementation for large language models inference with matrix core
    Zhang, Yu
    Lu, Lu
    Zhao, Rong
    Guo, Yijie
    Yang, Zhanyu
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (03):
  • [39] Distributed Inference and Fine-tuning of Large Language Models Over The Internet
    Borzunov, Alexander
    Ryabinin, Max
    Chumachenko, Artem
    Baranchuk, Dmitry
    Dettmers, Tim
    Belkada, Younes
    Samygin, Pavel
    Raffel, Colin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [40] Assessing Large Language Models for Oncology Data Inference From Radiology Reports
    Chen, Li-Ching
    Zack, Travis
    Demirci, Arda
    Sushil, Madhumita
    Miao, Brenda
    Kasap, Corynn
    Butte, Atul
    Collisson, Eric A.
    Hong, Julian C.
    JCO CLINICAL CANCER INFORMATICS, 2024, 8