Improved Speaker and Navigator for Vision-and-Language Navigation

被引:2
|
作者
Wu, Zongkai [1 ]
Liu, Zihan [1 ]
Wang, Ting [1 ]
Wang, Donglin [2 ]
机构
[1] Westlake Univ, Hangzhou 310024, Peoples R China
[2] Westlake Univ, Sch Engn, Machine Intelligence Lab MiLAB, Hangzhou 310024, Peoples R China
关键词
Navigation; Visualization; Decoding; Trajectory; Task analysis; Feature extraction; Head;
D O I
10.1109/MMUL.2021.3058314
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process intermodal interactions has been widely verified; however, LSTM neglects the intramodel interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this article, we propose novel Transformer-based multimodal frameworks for the navigator and speaker, respectively. In our frameworks, the multihead self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.
引用
收藏
页码:55 / 63
页数:9
相关论文
共 50 条
  • [1] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [2] Multi-Grounding Navigator for Self-Supervised Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Donglin
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [3] FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation
    Dou, Zi-Yi
    Peng, Nanyun
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4332 - 4340
  • [4] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [5] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [6] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [7] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [9] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
  • [10] Local Slot Attention for Vision-and-Language Navigation
    Zhuang, Yifeng
    Sun, Qiang
    Fu, Yanwei
    Chen, Lifeng
    Xue, Xiangyang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553