Measuring Progress in Fine-grained Vision-and-Language Understanding

被引:0
|
作者
Bugliarello, Emanuele [1 ,2 ]
Sartran, Laurent [1 ]
Agrawal, Aishwarya [1 ]
Hendricks, Lisa Anne [1 ]
Nematzadeh, Aida [1 ]
机构
[1] DeepMind, London, England
[2] Univ Copenhagen, Copenhagen, Denmark
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While pretraining on large-scale image-text data from theWeb has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.
引用
收藏
页码:1559 / 1582
页数:24
相关论文
共 50 条
  • [1] Auxiliary Fine-grained Alignment Constraints for Vision-and-Language Navigation
    Cui, Yibo
    Huang, Ruqiang
    Zhang, Yakun
    Cen, Yingjie
    Xie, Liang
    Yan, Ye
    Yin, Erwei
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2621 - 2626
  • [2] Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
    He, Keji
    Huang, Yan
    Wu, Qi
    Yang, Jianhua
    An, Dong
    Sima, Shuanglin
    Wang, Liang
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] Fine-grained Image Classification via Combining Vision and Language
    He, Xiangteng
    Peng, Yuxin
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7332 - 7340
  • [4] fine-grained comparison of pragmatic language understanding in humans and language models
    Hu, Jennifer
    Floyd, Sammy
    Jouravlev, Olessia
    Fedorenko, Evelina
    Gibson, Edward
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4194 - 4213
  • [5] Fine-grained visual understanding and reasoning
    Yu, Jun
    Yang, Yezhou
    Murtagh, Fionn
    Gao, Xinbo
    [J]. NEUROCOMPUTING, 2020, 398 (398) : 408 - 410
  • [6] Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
    Iki, Taichi
    Aizawa, Akiko
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2189 - 2196
  • [8] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
    Sun, Hongbo
    He, Xiangteng
    Zhou, Jiahuan
    Peng, Yuxin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
  • [9] MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
    Zhao, Zijia
    Guo, Longteng
    He, Xingjian
    Shao, Shuai
    Yuan, Zehuan
    Liu, Jing
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1528 - 1538
  • [10] Fine-Grained Crowdsourcing for Fine-Grained Recognition
    Jia Deng
    Krause, Jonathan
    Li Fei-Fei
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587