Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model

被引:0
|
作者
Wei, Haoran [1 ]
Kong, Lingyu [2 ]
Chen, Jinyue [2 ]
Zhao, Liang [1 ]
Ge, Zheng [1 ]
Yang, Jinrong [3 ]
Sun, Jianjian [1 ]
Han, Chunrui [1 ]
Zhang, Xiangyu [1 ]
机构
[1] MEGVII Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
关键词
LVLM; Vision vocabulary; Fine-grained perception;
D O I
10.1007/978-3-031-73235-5_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.
引用
收藏
页码:408 / 424
页数:17
相关论文
共 50 条
  • [31] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [32] Reinforcement Learning Friendly Vision-Language Model for Minecraft
    Jiang, Haobin
    Yue, Junpeng
    Luo, Hao
    Ding, Ziluo
    Lu, Zongqing
    COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 1 - 17
  • [33] A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model
    Xu, Mengde
    Zhang, Zheng
    Wei, Fangyun
    Lin, Yutong
    Cao, Yue
    Hu, Han
    Bai, Xiang
    COMPUTER VISION, ECCV 2022, PT XXIX, 2022, 13689 : 736 - 753
  • [34] Visual In-Context Learning for Large Vision-Language Models
    Zhou, Yucheng
    Le, Xiang
    Wang, Qianning
    Shen, Jianbing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 15890 - 15902
  • [35] Learning the Visualness of Text Using Large Vision-Language Models
    Verma, Gaurav
    Rossi, Ryan A.
    Tensmeyer, Christopher
    Gu, Jiuxiang
    Nenkova, Ani
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2394 - 2408
  • [36] Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
    Li, Hao
    Zhu, Jinguo
    Jiang, Xiaohu
    Zhu, Xizhou
    Li, Hongsheng
    Yuan, Chun
    Wang, Xiaohua
    Qiao, Yu
    Wang, Xiaogang
    Wang, Wenhai
    Dai, Jifeng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2691 - 2700
  • [37] PROMETHEUS- VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation
    Lee, Seongyun
    Kim, Seungone
    Park, Sue Hyun
    Kim, Geewook
    Seo, Minjoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11286 - 11315
  • [38] MammoVLM: A generative large vision-language model for mammography-related diagnostic assistance
    Cao, Zhenjie
    Deng, Zhuo
    Ma, Jie
    Hu, Jintao
    Ma, Lan
    INFORMATION FUSION, 2025, 118
  • [39] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
    Zhang, Wenbo
    Zhang, Yifan
    Lin, Jianfeng
    Huang, Binqiang
    Zhang, Jinlu
    Yu, Wenhao
    PATTERN RECOGNITION, 2025, 164
  • [40] Towards Better Vision-Inspired Vision-Language Models
    Cao, Yun-Hao
    Ji, Kaixiang
    Huang, Ziyuan
    Zheng, Chuanyang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Yang, Ming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547