InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

被引:0
|
作者
Dai, Wenliang [1 ,2 ,4 ]
Li, Junnan [1 ]
Li, Dongxu [1 ]
Tiong, Anthony Meng Huat [1 ,3 ]
Zhao, Junqi [3 ]
Wang, Weisheng [3 ]
Li, Boyang [3 ]
Fung, Pascale [2 ]
Hoi, Steven [1 ]
机构
[1] Salesforce Res, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Nanyang Technol Univ, Singapore, Singapore
[4] Salesforce, Beijing, Peoples R China
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-source.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
    Luo, Gen
    Zhou, Yiyi
    Ren, Tianhe
    Chen, Shengxin
    Sun, Xiaoshuai
    Ji, Rongrong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [3] General-Purpose Models in Biological and Computer Vision
    Elder, James
    [J]. PERCEPTION, 2015, 44 : 359 - 359
  • [4] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
  • [5] Privacy Risks of General-Purpose Language Models
    Pan, Xudong
    Zhang, Mi
    Ji, Shouling
    Yang, Min
    [J]. 2020 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP 2020), 2020, : 1314 - 1331
  • [6] Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture
    Gupta, Tanmay
    Kamath, Amita
    Kembhavi, Aniruddha
    Hoiem, Derek
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16378 - 16388
  • [7] A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language
    Dinh Anh Vu
    Quang Nhat Minh Pham
    Giang Son Tran
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [8] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [9] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
    Li, Xin
    Lian, Dongze
    Lu, Zhihe
    Bai, Jiawang
    Chen, Zhibo
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
    Ma, Chengcheng
    Liu, Yang
    Deng, Jiankang
    Xie, Lingxi
    Dong, Weiming
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629