High Efficiency Image Compression for Large Visual-Language Models

被引:1
|
作者
Li, Binzhe [1 ]
Wang, Shurun [2 ]
Wang, Shiqi [1 ]
Ye, Yan [3 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
[2] Alibaba Grp, Beijing 311121, Peoples R China
[3] Alibaba Grp US, Sunnyvale, CA 94085 USA
关键词
Image compression for machine; large visual-language model; pre-editing process; VIDEO;
D O I
10.1109/TCSVT.2024.3488181
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression scheme consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
引用
收藏
页码:2870 / 2880
页数:11
相关论文
共 50 条
  • [31] SDPT: Synchronous Dual Prompt Tuning for Fusion-Based Visual-Language Pre-trained Models
    Zhou, Yang
    Wu, Yongjian
    Saiyin, Jiya
    Wei, Bingzheng
    Lai, Maode
    Chang, Eric
    Xu, Yan
    COMPUTER VISION - ECCV 2024, PT XLIX, 2025, 15107 : 340 - 356
  • [32] SegCLIP: Multimodal Visual-Language and Prompt Learning for High-Resolution Remote Sensing Semantic Segmentation
    Zhang, Shijie
    Zhang, Bin
    Wu, Yuntao
    Zhou, Huabing
    Jiang, Junjun
    Ma, Jiayi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [33] FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications
    Tatsukawa, Yuki
    Shen, I-Chao
    Qi, Anran
    Koyama, Yuki
    Igarashi, Takeo
    Shamir, Ariel
    COMPUTER GRAPHICS FORUM, 2024, 43 (02)
  • [34] Enhanced ADHD detection: Frequency information embedded in a visual-language framework
    Hu, Runze
    Zhu, Kaishi
    Hou, Zhenzhe
    Wang, Ruideng
    Liu, Feifei
    DISPLAYS, 2024, 83
  • [35] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization
    Yao, Hantao
    Zhang, Rui
    Xu, Changsheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6757 - 6767
  • [36] Learning Semantic-aware Representation in Visual-Language Models for Multi-label Recognition with Partial Labels
    Ruan, Haoxian
    Xu, Zhihua
    Yang, Zhijing
    Lu, Yongyi
    Qin, Jinghui
    Chen, Tianshui
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)
  • [37] CuTCP: Custom Text Generation-based Class-aware Prompt Tuning for visual-language models
    Huang, Min
    Yang, Chen
    Yu, Xiaoyan
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [38] Evolving Interpretable Visual Classifiers with Large Language Models
    Chiquier, Mia
    Mall, Utkarsh
    Vondrick, Carl
    COMPUTER VISION - ECCV 2024, PT LXIV, 2025, 15122 : 183 - 201
  • [39] Reconsidering learnable fine-grained text prompts for few-shot anomaly detection in visual-language models
    Han, Delong
    Xu, Luo
    Zhou, Mingle
    Wan, Jin
    Li, Min
    Li, Gang
    NEURAL NETWORKS, 2025, 182
  • [40] SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
    Lin, Ziyi
    Liu, Dongyang
    Zhang, Renrui
    Gao, Peng
    Qiu, Longtian
    Xiao, Han
    Qiu, Han
    Shao, Wenqi
    Chen, Keqin
    Han, Jiaming
    Huang, Siyuan
    Zhang, Yichi
    He, Xuming
    Qiao, Yu
    Li, Hongsheng
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 36 - 55