Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model

被引：0

作者：

Wei, Haoran ^{[1
]}

Kong, Lingyu ^{[2
]}

Chen, Jinyue ^{[2
]}

Zhao, Liang ^{[1
]}

Ge, Zheng ^{[1
]}

Yang, Jinrong ^{[3
]}

Sun, Jianjian ^{[1
]}

Han, Chunrui ^{[1
]}

Zhang, Xiangyu ^{[1
]}

机构：

[1] MEGVII Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

来源：

COMPUTER VISION-ECCV 2024, PT IV | 2025年 / 15062卷

关键词：

LVLM; Vision vocabulary; Fine-grained perception;

D O I：

10.1007/978-3-031-73235-5_23

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.

引用

页码：408 / 424

页数：17

共 50 条

[41] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Jin, Haibo
Hu, Leyang
Li, Xinnuo
Zhang, Peiyan
Chen, Chonghan
Zhuang, Jun
Wang, Haohan
arXiv,
[42] Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
Groot, Tobias
Valdenegro-Toro, Matias
arXiv,
[43] SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model
Zhan, Yang
Xiong, Zhitong
Yuan, Yuan
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 221 : 64 - 77
[44] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Luo, Gen
Zhou, Yiyi
Ren, Tianhe
Chen, Shengxin
Sun, Xiaoshuai
Ji, Rongrong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[45] Masked Vision-language Transformer in Fashion
Ji, Ge-Peng
Zhuge, Mingchen
Gao, Dehong
Fan, Deng-Ping
Sakaridis, Christos
Gool, Luc Van
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
[46] Masked Vision-language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
Dehong Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
Machine Intelligence Research, 2023, 20 : 421 - 434
[47] Learning to Prompt for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2337 - 2348
[48] Causal Attention for Vision-Language Tasks
Yang, Xu
Zhang, Hanwang
Qi, Guojun
Cai, Jianfei
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9842 - 9852
[49] Vision-Language Models for Biomedical Applications
Thapa, Surendrabikram
Naseem, Usman
Zhou, Luping
Kim, Jinman
PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON VISION-LANGUAGE MODELS FOR BIOMEDICAL APPLICATIONS, VLM4BIO 2024, 2024, : 1 - 2
[50] Vision-language navigation: a survey and taxonomy
Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316

← 1 2 3 4 5 →