Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

被引:0
|
作者
Hu, Ronghang [1 ]
Fried, Daniel [1 ]
Rohrbach, Anna [1 ]
Klein, Dan [1 ]
Darrell, Trevor [1 ]
Saenko, Kate [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Boston Univ, Boston, MA 02215 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) requires grounding instructions, such as turn right and stop at the door, to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. stop at the door might ground into visual objects, while turn right might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.
引用
收藏
页码:6551 / 6557
页数:7
相关论文
共 50 条
  • [1] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [2] Multi-Grounding Navigator for Self-Supervised Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Donglin
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] Federated Learning for Vision-and-Language Grounding Problems
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Fan, Wei
    Zou, Yuexian
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
  • [4] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [5] Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
    Ku, Alexander
    Anderson, Peter
    Patel, Roma
    Le, Eugene
    Baldridge, Jason
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4392 - 4412
  • [6] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [7] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [9] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
  • [10] Does Vision-and-Language Pretraining Improve Lexical Grounding?
    Yun, Tian
    Sun, Chen
    Pavlick, Ellie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 4357 - 4366