High Efficiency Image Compression for Large Visual-Language Models

被引：1

作者：

Li, Binzhe ^{[1
]}

Wang, Shurun ^{[2
]}

Wang, Shiqi ^{[1
]}

Ye, Yan ^{[3
]}

机构：

[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China

[2] Alibaba Grp, Beijing 311121, Peoples R China

[3] Alibaba Grp US, Sunnyvale, CA 94085 USA

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

关键词：

Image compression for machine; large visual-language model; pre-editing process; VIDEO;

D O I：

10.1109/TCSVT.2024.3488181

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression scheme consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding. Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

引用

页码：2870 / 2880

页数：11

共 50 条

[1] Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
Li, Xin
Wu, Yunfei
Jiang, Xinghua
Guo, Zhihao
Gong, Mingming
Cao, Haoyu
Liu, Yinsong
Jiang, Deqiang
Sun, Xing
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15546 - 15555
[2] Visual-language foundation models in medicine
Liu, Chunyu
Jin, Yixiao
Guan, Zhouyu
Li, Tingyao
Qin, Yiming
Qian, Bo
Jiang, Zehua
Wu, Yilan
Wang, Xiangning
Zheng, Ying Feng
Zeng, Dian
VISUAL COMPUTER, 2025, 41 (04): : 2953 - 2972
[3] VTPL: Visual and text prompt learning for visual-language models
Sun, Bo
Wu, Zhichao
Zhang, Hao
He, Jun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 104
[4] Prompting Visual-Language Models for Efficient Video Understanding
Ju, Chen
Han, Tengda
Zheng, Kunhao
Zhang, Ya
Xie, Weidi
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 105 - 124
[5] Context Compression and Extraction: Efficiency Inference of Large Language Models
Zhou, Junyao
Du, Ruiqing
Tan, Yushan
Yang, Jintao
Yang, Zonghao
Luo, Wei
Luo, Zhunchen
Zhou, Xian
Hu, Wenpeng
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024, 2024, 14875 : 221 - 232
[6] Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images
Huang, Chaoqin
Han, Aofan
Feng, Jinghao
Zhang, Ya
Wan, Xinchao
Wang, Yanfeng
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 11375 - 11385
[7] Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning
Zhu, Wencai
Jiang, Zetao
He, Yuting
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
[8] ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Chen, Yuxin
Zhang, Zongyang
Zhang, Ziqi
Qi, Zhongang
Yuan, Chunfeng
Shan, Ying
Li, Bing
Hu, Weiming
Qie, Xiaohu
Wu, JianPing
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11018 - 11027
[9] Open-set domain adaptation with visual-language foundation models
Yu, Qing
Irie, Go
Aizawa, Kiyoharu
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 250
[10] Active Perception for Visual-Language Navigation
Hanqing Wang
Wenguan Wang
Wei Liang
Steven C. H. Hoi
Jianbing Shen
Luc Van Gool
International Journal of Computer Vision, 2023, 131 : 607 - 625

← 1 2 3 4 5 →