Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

被引:0
|
作者
Frank, Stella [1 ]
Bugliarello, Emanuele [2 ]
Elliott, Desmond [2 ]
机构
[1] Univ Trento, Trento, Italy
[2] Univ Copenhagen, Copenhagen, Denmark
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.
引用
收藏
页码:9847 / 9857
页数:11
相关论文
共 50 条
  • [1] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    Irshad, Muhammad Zubair
    Ma, Chih-Yao
    Kira, Zsolt
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
  • [2] Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
    Ramshetty, Shivaen
    Verma, Gaurav
    Kumar, Srijan
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15974 - 15990
  • [3] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
    Ni, Han
    Chen, Jia
    Zhu, DaYong
    Shi, Dianxi
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
  • [4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [5] Vision-and-Language Navigation Based on Cross-Modal Feature Fusion in Indoor Environment
    Wen, Shuhuan
    Lv, Xiaohan
    Yu, F. Richard
    Gong, Simeng
    [J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 3 - 15
  • [6] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
  • [7] Cross-modal Map Learning for Vision and Language Navigation
    Georgakis, Georgios
    Schmeckpeper, Karl
    Wanchoo, Karan
    Dan, Soham
    Miltsakaki, Eleni
    Roth, Dan
    Daniilidis, Kostas
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
  • [8] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
    Zhou, Mingyang
    Zhou, Luowei
    Wang, Shuohang
    Cheng, Yu
    Li, Linjie
    Yu, Zhou
    Liu, Jingjing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163
  • [9] Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment
    Wen, Shuhuan
    Gong, Simeng
    Zhang, Ziyuan
    Yu, F. Richard
    Wang, Zhiwen
    [J]. Knowledge-Based Systems, 2024, 305
  • [10] Transformer-Exclusive Cross-Modal Representation for Vision and Language
    Shin, Andrew
    Narihira, Takuya
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2719 - 2725