VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Zhou, Wangchunshu [1 ]
Zeng, Yan [1 ]
Diao, Shizhe [2 ]
Zhang, Xinsong [1 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off ("Pareto SOTA") of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark(1) to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Multi-task Learning of Hierarchical Vision-Language Representation
    Duy-Kien Nguyen
    Okatani, Takayuki
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10484 - 10493
  • [2] Multi-task prompt tuning with soft context sharing for vision-language models
    Ding, Kun
    Wang, Ying
    Liu, Pengzhang
    Yu, Qiang
    Zhang, Haojian
    Xiang, Shiming
    Pan, Chunhong
    [J]. NEUROCOMPUTING, 2024, 603
  • [3] CALM-Bench: A Multi-task Benchmark for Evaluating Causality Aware Language Models
    Dalal, Dhairya
    Arcan, Mihael
    Buitelaar, Paul
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 296 - 311
  • [4] Align vision-language semantics by multi-task learning for multi-modal summarization
    Chenhao Cui
    Xinnian Liang
    Shuangzhi Wu
    Zhoujun Li
    [J]. Neural Computing and Applications, 2024, 36 (25) : 15653 - 15666
  • [5] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [6] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [7] Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-Training
    Zhang, Ke
    Yang, Yan
    Yu, Jun
    Jiang, Hanliang
    Fan, Jianping
    Huang, Qingming
    Han, Weidong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4706 - 4721
  • [8] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
    Long, Sifan
    Zhao, Zhen
    Yuan, Junkun
    Tan, Zichang
    Liu, Jiangjiang
    Zhou, Luping
    Wang, Shengsheng
    Wang, Jingdong
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
  • [10] Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models
    Ma, Zheng
    Pan, Mianzhi
    Wu, Wenhan
    Cheng, Kanzhi
    Zhang, Jianbing
    Huang, Shujian
    Chen, Jiajun
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5674 - 5685