Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

被引：0

作者：

Frank, Stella ^{[1
]}

Bugliarello, Emanuele ^{[2
]}

Elliott, Desmond ^{[2
]}

机构：

[1] Univ Trento, Trento, Italy

[2] Univ Copenhagen, Copenhagen, Denmark

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.

引用

页码：9847 / 9857

页数：11

共 50 条

[1] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
Irshad, Muhammad Zubair
Ma, Chih-Yao
Kira, Zsolt
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
[2] Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
Ramshetty, Shivaen
Verma, Gaurav
Kumar, Srijan
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15974 - 15990
[3] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
Ni, Han
Chen, Jia
Zhu, DaYong
Shi, Dianxi
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
[4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
Wu, Siying
Fu, Xueyang
Wu, Feng
Zha, Zheng-Jun
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
[5] Vision-and-Language Navigation Based on Cross-Modal Feature Fusion in Indoor Environment
Wen, Shuhuan
Lv, Xiaohan
Yu, F. Richard
Gong, Simeng
[J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 3 - 15
[6] Topological Planning with Transformers for Vision-and-Language Navigation
Chen, Kevin
Chen, Junshen K.
Chuang, Jo
Vazquez, Marynel
Savarese, Silvio
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
[7] Cross-modal Map Learning for Vision and Language Navigation
Georgakis, Georgios
Schmeckpeper, Karl
Wanchoo, Karan
Dan, Soham
Miltsakaki, Eleni
Roth, Dan
Daniilidis, Kostas
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
[8] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Zhou, Mingyang
Zhou, Luowei
Wang, Shuohang
Cheng, Yu
Li, Linjie
Yu, Zhou
Liu, Jingjing
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
[9] Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment
Wen, Shuhuan
Gong, Simeng
Zhang, Ziyuan
Yu, F. Richard
Wang, Zhiwen
[J]. Knowledge-Based Systems, 2024, 305
[10] Transformer-Exclusive Cross-Modal Representation for Vision and Language
Shin, Andrew
Narihira, Takuya
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2719 - 2725

← 1 2 3 4 5 →