High Efficiency Image Compression for Large Visual-Language Models

被引:1
|
作者
Li, Binzhe [1 ]
Wang, Shurun [2 ]
Wang, Shiqi [1 ]
Ye, Yan [3 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[2] Alibaba Grp, Beijing 311121, Peoples R China
[3] Alibaba Grp US, Sunnyvale, CA 94085 USA
关键词
Image compression for machine; large visual-language model; pre-editing process; VIDEO;
D O I
10.1109/TCSVT.2024.3488181
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression scheme consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
引用
收藏
页码:2870 / 2880
页数:11
相关论文
共 50 条
  • [1] Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
    Li, Xin
    Wu, Yunfei
    Jiang, Xinghua
    Guo, Zhihao
    Gong, Mingming
    Cao, Haoyu
    Liu, Yinsong
    Jiang, Deqiang
    Sun, Xing
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15546 - 15555
  • [2] Visual-language foundation models in medicine
    Liu, Chunyu
    Jin, Yixiao
    Guan, Zhouyu
    Li, Tingyao
    Qin, Yiming
    Qian, Bo
    Jiang, Zehua
    Wu, Yilan
    Wang, Xiangning
    Zheng, Ying Feng
    Zeng, Dian
    VISUAL COMPUTER, 2025, 41 (04): : 2953 - 2972
  • [3] VTPL: Visual and text prompt learning for visual-language models
    Sun, Bo
    Wu, Zhichao
    Zhang, Hao
    He, Jun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 104
  • [4] Prompting Visual-Language Models for Efficient Video Understanding
    Ju, Chen
    Han, Tengda
    Zheng, Kunhao
    Zhang, Ya
    Xie, Weidi
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 105 - 124
  • [5] Context Compression and Extraction: Efficiency Inference of Large Language Models
    Zhou, Junyao
    Du, Ruiqing
    Tan, Yushan
    Yang, Jintao
    Yang, Zonghao
    Luo, Wei
    Luo, Zhunchen
    Zhou, Xian
    Hu, Wenpeng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024, 2024, 14875 : 221 - 232
  • [6] Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images
    Huang, Chaoqin
    Han, Aofan
    Feng, Jinghao
    Zhang, Ya
    Wan, Xinchao
    Wang, Yanfeng
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 11375 - 11385
  • [7] Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning
    Zhu, Wencai
    Jiang, Zetao
    He, Yuting
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
  • [8] ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
    Chen, Yuxin
    Zhang, Zongyang
    Zhang, Ziqi
    Qi, Zhongang
    Yuan, Chunfeng
    Shan, Ying
    Li, Bing
    Hu, Weiming
    Qie, Xiaohu
    Wu, JianPing
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11018 - 11027
  • [9] Open-set domain adaptation with visual-language foundation models
    Yu, Qing
    Irie, Go
    Aizawa, Kiyoharu
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 250
  • [10] Active Perception for Visual-Language Navigation
    Hanqing Wang
    Wenguan Wang
    Wei Liang
    Steven C. H. Hoi
    Jianbing Shen
    Luc Van Gool
    International Journal of Computer Vision, 2023, 131 : 607 - 625