High Efficiency Image Compression for Large Visual-Language Models

被引:1
|
作者
Li, Binzhe [1 ]
Wang, Shurun [2 ]
Wang, Shiqi [1 ]
Ye, Yan [3 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[2] Alibaba Grp, Beijing 311121, Peoples R China
[3] Alibaba Grp US, Sunnyvale, CA 94085 USA
关键词
Image compression for machine; large visual-language model; pre-editing process; VIDEO;
D O I
10.1109/TCSVT.2024.3488181
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression scheme consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
引用
收藏
页码:2870 / 2880
页数:11
相关论文
共 50 条
  • [21] A Survey on Model Compression for Large Language Models
    Zhu, Xunyu
    Li, Jian
    Liu, Yong
    Ma, Can
    Wang, Weiping
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1556 - 1577
  • [22] Task-Oriented Grasp Prediction with Visual-Language Inputs
    Tang, Chao
    Huang, Dehao
    Meng, Lingxiao
    Liu, Weiyu
    Zhang, Hong
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 4881 - 4888
  • [23] Large Language Models are Visual Reasoning Coordinators
    Chen, Liangyu
    Li, Bo
    Shen, Sheng
    Yang, Jingkang
    Li, Chunyuan
    Keutzer, Kurt
    Darrell, Trevor
    Liu, Ziwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Visual cognition in multimodal large language models
    Buschoff, Luca M. Schulze
    Akata, Elif
    Bethge, Matthias
    Schulz, Eric
    NATURE MACHINE INTELLIGENCE, 2025, 7 (01) : 96 - 106
  • [25] ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
    Wang, Xinpeng
    Yi, Xiaoyuan
    Jiang, Han
    Zhou, Shanlin
    Wei, Zhihua
    Xie, Xing
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3508 - 3533
  • [26] Most and Least Retrievable Images in Visual-Language Query Systems
    Zhu, Liuwan
    Ning, Rui
    Li, Jiang
    Xin, Chunsheng
    Wu, Hongyi
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 1 - 18
  • [27] Knowledge-Enhanced Visual-Language Pretraining for Computational Pathology
    Zhou, Xiao
    Zhang, Xiaoman
    Wu, Chaoyi
    Zhang, Ya
    Xie, Weidi
    Wang, Yanfeng
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 345 - 362
  • [28] Exploring image-text combinations in visual humour through large language models (LLMs)
    Soriano-Gonzalez, Laura
    Belda-Medina, Jose
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2024,
  • [29] Category-instance distillation based on visual-language models for rehearsal-free class incremental learning
    Jin, Weilong
    Wang, Zilei
    Zhang, Yixin
    IET COMPUTER VISION, 2024,
  • [30] Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining
    Zhou, Benjia
    Chen, Zhigang
    Clapes, Albert
    Wan, Jun
    Liang, Yanyan
    Escalera, Sergio
    Lei, Zhen
    Zhang, Du
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20814 - 20824