VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

被引:0
|
作者
Zhou, Wangchunshu [1 ]
Zeng, Yan [1 ]
Diao, Shizhe [2 ]
Zhang, Xinsong [1 ]
机构
[1] ByteDance AI Lab, Beijing, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in vision-language pre-training (VLP) have demonstrated impressive performance in a range of vision-language (VL) tasks. However, there exist several challenges for measuring the community's progress in building general multi-modal intelligence. First, most of the downstream VL datasets are annotated using raw images that are already seen during pre-training, which may result in an overestimation of current VLP models' generalization ability. Second, recent VLP work mainly focuses on absolute performance but overlooks the efficiency-performance trade-off, which is also an important indicator for measuring progress. To this end, we introduce the Vision-Language Understanding Evaluation (VLUE) benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off ("Pareto SOTA") of VLP models. We demonstrate that there is a sizable generalization gap for all VLP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures. Moreover, we find that measuring the efficiency-performance trade-off of VLP models leads to complementary insights for several design choices of VLP. We release the VLUE benchmark(1) to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
引用
收藏
页数:17
相关论文
共 50 条
  • [31] Effectiveness assessment of recent large vision-language models
    Yao Jiang
    Xinyu Yan
    Ge-Peng Ji
    Keren Fu
    Meijun Sun
    Huan Xiong
    Deng-Ping Fan
    Fahad Shahbaz Khan
    [J]. Visual Intelligence, 2 (1):
  • [32] Towards an Exhaustive Evaluation of Vision-Language Foundation Models
    Salin, Emmanuelle
    Ayache, Stephane
    Favre, Benoit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 339 - 352
  • [33] Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
    Saxena, Saumya
    Sharma, Mohit
    Kroemer, Oliver
    [J]. CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
  • [34] ECO: Ensembling Context Optimization for Vision-Language Models
    Agnolucci, Lorenzo
    Baldrati, Alberto
    Todino, Francesco
    Becattini, Federico
    Bertini, Marco
    Del Bimbo, Alberto
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2803 - 2807
  • [35] DeAR: Debiasing Vision-Language Models with Additive Residuals
    Seth, Ashish
    Hemani, Mayur
    Agarwal, Chirag
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6820 - 6829
  • [36] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [37] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    [J]. NATURE MEDICINE, 2024,
  • [38] Language Modelling as a Multi-Task Problem
    Weber, Lucas
    Jumelet, Jaap
    Bruni, Elia
    Hupkes, Dieuwke
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2049 - 2060
  • [39] Improving Vision Transformer with Multi-Task Training
    Ahn, Woo Jin
    Yang, Geun Yeong
    Choi, Hyun Duck
    Lim, Myo Taeg
    Kang, Tae Koo
    [J]. 2022 22ND INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2022), 2022, : 1963 - 1965
  • [40] Making Small Language Models Better Multi-task Learners with Mixture-of-Task-Adapters
    Xie, Yukang
    Wang, Chengyu
    Yan, Junbing
    Zhou, Jiyong
    Deng, Feiqi
    Huang, Jun
    [J]. PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 1094 - 1097