Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model

被引:0
|
作者
Wei, Haoran [1 ]
Kong, Lingyu [2 ]
Chen, Jinyue [2 ]
Zhao, Liang [1 ]
Ge, Zheng [1 ]
Yang, Jinrong [3 ]
Sun, Jianjian [1 ]
Han, Chunrui [1 ]
Zhang, Xiangyu [1 ]
机构
[1] MEGVII Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
关键词
LVLM; Vision vocabulary; Fine-grained perception;
D O I
10.1007/978-3-031-73235-5_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.
引用
收藏
页码:408 / 424
页数:17
相关论文
共 50 条
  • [11] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [12] Vision-language foundation model for echocardiogram interpretation
    Christensen, Matthew
    Vukadinovic, Milos
    Yuan, Neal
    Ouyang, David
    NATURE MEDICINE, 2024, 30 (05) : 1481 - +
  • [13] A vision-language foundation model for precision oncology
    Xiang, Jinxi
    Wang, Xiyue
    Zhang, Xiaoming
    Xi, Yinghua
    Eweje, Feyisope
    Chen, Yijiang
    Li, Yuchen
    Bergstrom, Colin
    Gopaulchan, Matthew
    Kim, Ted
    Yu, Kun-Hsing
    Willens, Sierra
    Olguin, Francesca Maria
    Nirschl, Jeffrey J.
    Neal, Joel
    Diehn, Maximilian
    Yang, Sen
    Li, Ruijiang
    NATURE, 2025, : 769 - 778
  • [14] A vision-language foundation model for clinical oncology
    Skourti, Eleni
    NATURE CANCER, 2025, 6 (02) : 226 - 226
  • [15] NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
    Sammani, Fawaz
    Mukherjee, Tanmoy
    Deligiannis, Nikos
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8312 - 8322
  • [16] Localized Vision-Language Matching for Open-vocabulary Object Detection
    Bravo, Maria A.
    Mittal, Sudhanshu
    Brox, Thomas
    PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 393 - 408
  • [17] Attention Prompting on Image for Large Vision-Language Models
    Yu, Runpeng
    Yu, Weihao
    Wang, Xinchao
    COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
  • [18] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    Visual Intelligence, 2 (1):
  • [19] Evaluating Attribute Comprehension in Large Vision-Language Models
    Zhang, Haiwen
    Yang, Zixi
    Liu, Yuanzhi
    Wang, Xinran
    He, Zheqi
    Liang, Kongming
    Ma, Zhanyu
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
  • [20] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,