Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

被引：0

作者：

Iki, Taichi ^{[1
,2
]}

Aizawa, Akiko ^{[1
,2
]}

机构：

[1] Natl Inst Informat, Chiyoda Ku, Tokyo, Japan

[2] Grad Univ Adv Studies, Hayama, Kanagawa, Japan

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.

引用

页码：2189 / 2196

页数：8

共 50 条

[1] Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Alper, Morris
Fiman, Michael
Averbuch-Elor, Hadar
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6778 - 6788
[2] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Zhou, Gengze
Hong, Yicong
Wu, Qi
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
[3] Kiki or Bouba? Sound Symbolism in Vision-and-Language Models
Alper, Morris
Averbuch-Elor, Hadar
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[4] Speaker-Follower Models for Vision-and-Language Navigation
Fried, Daniel
Hu, Ronghang
Cirik, Volkan
Rohrbach, Anna
Andreas, Jacob
Morency, Louis-Philippe
Berg-Kirkpatrick, Taylor
Saenko, Kate
Klein, Dan
Darrell, Trevor
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[5] Measuring Progress in Fine-grained Vision-and-Language Understanding
Bugliarello, Emanuele
Sartran, Laurent
Agrawal, Aishwarya
Hendricks, Lisa Anne
Nematzadeh, Aida
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1559 - 1582
[6] Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness
Gao, Fang
Tang, Jingfeng
Wang, Jiabao
Li, Shaodong
Yu, Jun
[J]. IEEE Robotics and Automation Letters, 2024, 9 (12) : 10874 - 10881
[7] Iterative Vision-and-Language Navigation
Krantz, Jacob
Banerjee, Shurjo
Zhu, Wang
Corso, Jason
Anderson, Peter
Lee, Stefan
Thomason, Jesse
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
[8] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
The Hebrew University of Jerusalem, Israel
不详
不详
[J]. arXiv, 1600,
[9] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
Bitton, Yonatan
Bitton-Guetta, Nitzan
Yosef, Ron
Elovici, Yuval
Bansal, Mohit
Stanovsky, Gabriel
Schwartz, Roy
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[10] Tools Identification By On-Board Adaptation of Vision-and-Language Models
Hu, Jun
Miller, Phil
Lomnitz, Michael
Farkya, Saurabh
Yilmaz, Emre
Raghavan, Aswin
Zhang, David
Piacentino, Michael
[J]. THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23799 - 23801

← 1 2 3 4 5 →