Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

被引:0
|
作者
Iki, Taichi [1 ,2 ]
Aizawa, Akiko [1 ,2 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo, Japan
[2] Grad Univ Adv Studies, Hayama, Kanagawa, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.
引用
收藏
页码:2189 / 2196
页数:8
相关论文
共 50 条
  • [1] Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
    Alper, Morris
    Fiman, Michael
    Averbuch-Elor, Hadar
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6778 - 6788
  • [2] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [3] Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
    Alper, Morris
    Averbuch-Elor, Hadar
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [5] Measuring Progress in Fine-grained Vision-and-Language Understanding
    Bugliarello, Emanuele
    Sartran, Laurent
    Agrawal, Aishwarya
    Hendricks, Lisa Anne
    Nematzadeh, Aida
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1559 - 1582
  • [6] Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness
    Gao, Fang
    Tang, Jingfeng
    Wang, Jiabao
    Li, Shaodong
    Yu, Jun
    [J]. IEEE Robotics and Automation Letters, 2024, 9 (12) : 10874 - 10881
  • [7] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [8] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
    The Hebrew University of Jerusalem, Israel
    不详
    不详
    [J]. arXiv, 1600,
  • [9] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
    Bitton, Yonatan
    Bitton-Guetta, Nitzan
    Yosef, Ron
    Elovici, Yuval
    Bansal, Mohit
    Stanovsky, Gabriel
    Schwartz, Roy
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Tools Identification By On-Board Adaptation of Vision-and-Language Models
    Hu, Jun
    Miller, Phil
    Lomnitz, Michael
    Farkya, Saurabh
    Yilmaz, Emre
    Raghavan, Aswin
    Zhang, David
    Piacentino, Michael
    [J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23799 - 23801