VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Zhou, Wangchunshu [1 ]
Zeng, Yan [1 ]
Diao, Shizhe [2 ]
Zhang, Xinsong [1 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off ("Pareto SOTA") of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark(1) to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Exploring Vision-Language Models for Imbalanced Learning
    Wang Y.
    Yu Z.
    Wang J.
    Heng Q.
    Chen H.
    Ye W.
    Xie R.
    Xie X.
    Zhang S.
    [J]. International Journal of Computer Vision, 2024, 132 (1) : 224 - 237
  • [22] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [23] Equivariant Similarity for Vision-Language Foundation Models
    Wang, Tan
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Yang, Zhengyuan
    Zhang, Hanwang
    Liu, Zicheng
    Wang, Lijuan
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
  • [24] Multi-task Learning with Bidirectional Language Models for Text Classification
    Yang, Qi
    Shang, Lin
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [25] AutoDistiller: An Automatic Compression Method for Multi-task Language Models
    Wang, Hongsheng
    Xiao, Geyang
    Liang, Yuan
    [J]. 2022 41ST CHINESE CONTROL CONFERENCE (CCC), 2022, : 2410 - 2415
  • [26] XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
    Hu, Junjie
    Ruder, Sebastian
    Siddhant, Aditya
    Neubig, Graham
    Firat, Orhan
    Johnson, Melvin
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [27] e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks
    Kayser, Maxime
    Camburu, Oana-Maria
    Salewski, Leonard
    Emde, Cornelius
    Do, Virginie
    Akata, Zeynep
    Lukasiewicz, Thomas
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1224 - 1234
  • [28] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [29] Tuning Vision-Language Models With Multiple Prototypes Clustering
    Guo, Meng-Hao
    Zhang, Yi
    Mu, Tai-Jiang
    Huang, Sharon X.
    Hu, Shi-Min
    [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12) : 11186 - 11199
  • [30] Compositional Kronecker Context Optimization for vision-language models
    Ding, Kun
    Li, Xiaohui
    Yu, Qiang
    Wang, Ying
    Zhang, Haojian
    Xiang, Shiming
    [J]. NEUROCOMPUTING, 2024, 608