Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model

被引:0
|
作者
Wei, Haoran [1 ]
Kong, Lingyu [2 ]
Chen, Jinyue [2 ]
Zhao, Liang [1 ]
Ge, Zheng [1 ]
Yang, Jinrong [3 ]
Sun, Jianjian [1 ]
Han, Chunrui [1 ]
Zhang, Xiangyu [1 ]
机构
[1] MEGVII Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
关键词
LVLM; Vision vocabulary; Fine-grained perception;
D O I
10.1007/978-3-031-73235-5_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.
引用
收藏
页码:408 / 424
页数:17
相关论文
共 50 条
  • [21] Evaluating Object Hallucination in Large Vision-Language Models
    Li, Yifan
    Du, Yifan
    Zhou, Kun
    Wang, Jinpeng
    Zhao, Wayne Xin
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 292 - 305
  • [22] DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics
    Huang, Zhiao
    Chen, Feng
    Pu, Yewen
    Lin, Chunru
    Su, Hao
    Gan, Chuang
    arXiv, 2023,
  • [23] DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics
    Huang, Zhiao
    Chen, Feng
    Pu, Yewen
    Lin, Chunru
    Su, Hao
    Gan, Chuang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Open-Vocabulary Multi-label Image Classification with Pretrained Vision-Language Model
    Dao, Son D.
    Huynh, Dat
    Zhao, He
    Phung, Dinh
    Cai, Jianfei
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2135 - 2140
  • [25] MiniMedGPT: Efficient Large Vision-Language Model for medical Visual Question Answering
    Alsabbagh, Abdel Rahman
    Mansour, Tariq
    Al-Kharabsheh, Mohammad
    Ebdah, Abdel Salam
    Al-Emaryeen, Roa'a
    Al-Nahhas, Sara
    Mahafza, Waleed
    Al-Kadi, Omar
    PATTERN RECOGNITION LETTERS, 2025, 189 : 8 - 16
  • [26] RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing
    Zhang, Zilun
    Zhao, Tiancheng
    Guo, Yulong
    Yin, Jianwei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [27] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
  • [28] A BIOLOGICALLY INSPIRED NEURAL MODEL OF VISION-LANGUAGE INTEGRATION
    Plebe, Alessio
    Mazzone, Marco
    De La Cruz, Vivian M.
    NEURAL NETWORK WORLD, 2011, 21 (03) : 227 - 249
  • [29] ClipCrop: Conditioned Cropping Driven by Vision-Language Model
    Zhong, Zhihang
    Cheng, Mingxi
    Wu, Zhirong
    Yuan, Yuhui
    Zheng, Yinqiang
    Li, Ji
    Hu, Han
    Lin, Stephen
    Sato, Yoichi
    Sato, Imari
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 294 - 304
  • [30] Cascade Prompt Learning for Vision-Language Model Adaptation
    Wu, Ge
    Zhang, Xin
    Li, Zheng
    Chen, Zhaowei
    Liang, Jiajun
    Yang, Jian
    Li, Xiang
    COMPUTER VISION - ECCV 2024, PT L, 2025, 15108 : 304 - 321