Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model

被引:0
|
作者
Wei, Haoran [1 ]
Kong, Lingyu [2 ]
Chen, Jinyue [2 ]
Zhao, Liang [1 ]
Ge, Zheng [1 ]
Yang, Jinrong [3 ]
Sun, Jianjian [1 ]
Han, Chunrui [1 ]
Zhang, Xiangyu [1 ]
机构
[1] MEGVII Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
来源
关键词
LVLM; Vision vocabulary; Fine-grained perception;
D O I
10.1007/978-3-031-73235-5_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary, i.e., CLIP, for common vision tasks. However, for some special task that needs dense and fine-grained perception, the CLIP-style vocabulary may encounter low efficiency in tokenizing corresponding vision knowledge and even suffer out-of-vocabulary problems. Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to compress rich vision signals. Next, we scale up the vanilla vision vocabulary by merging the new with the original one (CLIP), enabling the LVLMs to garner new features effectively. We present frameworks with two sizes: Vary-base (7B) and Vary-toy (1.8B), both of which enjoy excellent fine-grained perception performance while maintaining great general ability.
引用
收藏
页码:408 / 424
页数:17
相关论文
共 50 条
  • [1] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
  • [2] On Scaling up a Multilingual Vision and Language Model
    Chen, Xi
    Djolonga, Josip
    Padlewski, Piotr
    Mustafa, Basil
    Changpinyo, Soravit
    Wu, Jialin
    Ruiz, Carlos Riquelme
    Goodman, Sebastian
    Wang, Xiao
    Tay, Yi
    Shakeri, Siamak
    Dehghani, Mostafa
    Salz, Daniel
    Lucic, Mario
    Tschannen, Michael
    Nagrani, Arsha
    Hu, Hexiang
    Joshi, Mandar
    Pang, Bo
    Montgomery, Ceslee
    Pietrzyk, Paulina
    Ritter, Marvin
    Piergiovanni, A. J.
    Minderer, Matthias
    Pavetic, Filip
    Waters, Austin
    Li, Gang
    Alabdulmohsin, Ibrahim
    Beyer, Lucas
    Amelot, Julien
    Lee, Kenton
    Steiner, Andreas Peter
    Li, Yang
    Keysers, Daniel
    Arnab, Anurag
    Xu, Yuanzhong
    Rong, Keran
    Kolesnikov, Alexander
    Seyedhosseini, Mojtaba
    Angelova, Anelia
    Zhai, Xiaohua
    Houlsby, Neil
    Soricut, Radu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14432 - 14444
  • [3] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
    Jia, Chao
    Yang, Yinfei
    Xia, Ye
    Chen, Yi-Ting
    Parekh, Zarana
    Pham, Hieu
    Le, Quoc, V
    Sung, Yunhsuan
    Li, Zhen
    Duerig, Tom
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [4] Pathologyvlm: a large vision-language model for pathology image understanding
    Dawei Dai
    Yuanhui Zhang
    Qianlan Yang
    Long Xu
    Xiaojing Shen
    Shuyin Xia
    Guoyin Wang
    Artificial Intelligence Review, 58 (6)
  • [5] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
    Du, Yu
    Wei, Fangyun
    Zhang, Zihe
    Shi, Miaojing
    Gao, Yue
    Li, Guoqi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14064 - 14073
  • [6] FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding
    Song, Duanxiao
    Gao, Dehong
    Liu, Gongshen
    Li, Xiaoyong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 308 - 323
  • [7] Scaling Vision-Language Models with Sparse Mixture of Experts
    Shen, Sheng
    Yao, Zhewei
    Li, Chunyuan
    Darrell, Trevor
    Keutzer, Kurt
    He, Yuxiong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11329 - 11344
  • [8] Robust Calibration of Large Vision-Language Adapters
    Murugesan, Balamurali
    Silva-Rodriguez, Julio
    Ben Ayed, Ismail
    Dolz, Jose
    COMPUTER VISION - ECCV 2024, PT XXIV, 2025, 15082 : 147 - 165
  • [9] Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
    Li, Xuanlin
    Fang, Yunhao
    Liu, Minghua
    Ling, Zhan
    Tu, Zhuowen
    Su, Hao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2492 - 2503
  • [10] UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Liu, Shuo
    Shao, Wenqi
    Wu, Yuwei
    Luo, Ping
    Qiao, Yu
    Zhang, Kaipeng
    arXiv,