Vision-and-Language Navigation Based on Cross-Modal Feature Fusion in Indoor Environment

被引:4
|
作者
Wen, Shuhuan [1 ]
Lv, Xiaohan [1 ,2 ]
Yu, F. Richard [3 ]
Gong, Simeng [1 ]
机构
[1] Yanshan Univ, Engn Res Ctr, Minist Educ Intelligent Control Syst & Intelligent, Qinhuangdao 066004, Peoples R China
[2] Yanshan Univ, Key Lab Ind Comp Control Engn Hebei Prov, Qinhuangdao 066004, Peoples R China
[3] Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada
基金
中国国家自然科学基金;
关键词
Attention; data augmentation; deep reinforcement learning (RL); vision-and-language navigation (VLN);
D O I
10.1109/TCDS.2021.3139543
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is challenging for an agent to simultaneously decipher visual and language information and make decisions to perform corresponding actions. Recently, the vision-and-language navigation task has been proposed to allow the agent to navigate based on a language instruction and the currently visible visual point information in a 3-D indoor real environment. The key to this task is that the agent needs to understand the information of the two models of vision and language in an unknown environment to navigate effectively. In this study, we capture the alignment relationship between visual features and language features using a cross-modal feature fusion method. Attention is used to set up the cross-modal fusion module so that visual features contain language information and language features contain visual information, thereby allowing the model to learn more feature relationships and improving the success rate (SR) of agent navigation. Considering the practical significance of the navigation of the agent, we aim to shorten the trajectory length of the agent as much as possible while ensuring that the agent reaches the target position successfully. We employ a reinforcement learning algorithm based on the advantage actor critic to constrain the action selection of the agent to shorten the trajectory length. In order to further improve the performance of the model and reduce the difference between the performance of the agent in known environments and unknown environments, we propose the data augmentation method Cro-Speaker, and the three training methods Speaker data augmentation (SD), Cro-Speaker data augmentation (CSD), and Speaker and Cro-Speaker data augmentation (SCSD) based on this method. We evaluate the proposed method based on the Room-to-Room data set. The results show that the proposed method improves the SR of the agent navigation, shortens the length of the navigation trajectory, and exhibits a good generalization performance in known and unknown environments.
引用
收藏
页码:3 / 15
页数:13
相关论文
共 50 条
  • [1] Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment
    Wen, Shuhuan
    Gong, Simeng
    Zhang, Ziyuan
    Yu, F. Richard
    Wang, Zhiwen
    [J]. Knowledge-Based Systems, 2024, 305
  • [2] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    Irshad, Muhammad Zubair
    Ma, Chih-Yao
    Kira, Zsolt
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
  • [3] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
    Ni, Han
    Chen, Jia
    Zhu, DaYong
    Shi, Dianxi
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
  • [4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [5] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
  • [6] Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
    Ramshetty, Shivaen
    Verma, Gaurav
    Kumar, Srijan
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15974 - 15990
  • [7] Cross-modal Map Learning for Vision and Language Navigation
    Georgakis, Georgios
    Schmeckpeper, Karl
    Wanchoo, Karan
    Dan, Soham
    Miltsakaki, Eleni
    Roth, Dan
    Daniilidis, Kostas
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
  • [8] ENVEDIT: Environment Editing for Vision-and-Language Navigation
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
  • [9] Diagnosing the Environment Bias in Vision-and-Language Navigation
    Zhang, Yubo
    Tan, Hao
    Bansal, Mohit
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
  • [10] Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1041 - 1050