Vision-and-Language Navigation Based on Cross-Modal Feature Fusion in Indoor Environment

被引：4

作者：

Wen, Shuhuan ^{[1
]}

Lv, Xiaohan ^{[1
,2
]}

Yu, F. Richard ^{[3
]}

Gong, Simeng ^{[1
]}

机构：

[1] Yanshan Univ, Engn Res Ctr, Minist Educ Intelligent Control Syst & Intelligent, Qinhuangdao 066004, Peoples R China

[2] Yanshan Univ, Key Lab Ind Comp Control Engn Hebei Prov, Qinhuangdao 066004, Peoples R China

[3] Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada

来源：

IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS | 2023年 / 15卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Attention; data augmentation; deep reinforcement learning (RL); vision-and-language navigation (VLN);

D O I：

10.1109/TCDS.2021.3139543

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is challenging for an agent to simultaneously decipher visual and language information and make decisions to perform corresponding actions. Recently, the vision-and-language navigation task has been proposed to allow the agent to navigate based on a language instruction and the currently visible visual point information in a 3-D indoor real environment. The key to this task is that the agent needs to understand the information of the two models of vision and language in an unknown environment to navigate effectively. In this study, we capture the alignment relationship between visual features and language features using a cross-modal feature fusion method. Attention is used to set up the cross-modal fusion module so that visual features contain language information and language features contain visual information, thereby allowing the model to learn more feature relationships and improving the success rate (SR) of agent navigation. Considering the practical significance of the navigation of the agent, we aim to shorten the trajectory length of the agent as much as possible while ensuring that the agent reaches the target position successfully. We employ a reinforcement learning algorithm based on the advantage actor critic to constrain the action selection of the agent to shorten the trajectory length. In order to further improve the performance of the model and reduce the difference between the performance of the agent in known environments and unknown environments, we propose the data augmentation method Cro-Speaker, and the three training methods Speaker data augmentation (SD), Cro-Speaker data augmentation (CSD), and Speaker and Cro-Speaker data augmentation (SCSD) based on this method. We evaluate the proposed method based on the Room-to-Room data set. The results show that the proposed method improves the SR of the agent navigation, shortens the length of the navigation trajectory, and exhibits a good generalization performance in known and unknown environments.

引用

页码：3 / 15

页数：13

共 50 条

[1] Vision-and-language navigation based on history-aware cross-modal feature fusion in indoor environment
Wen, Shuhuan
Gong, Simeng
Zhang, Ziyuan
Yu, F. Richard
Wang, Zhiwen
[J]. Knowledge-Based Systems, 2024, 305
[2] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
Irshad, Muhammad Zubair
Ma, Chih-Yao
Kira, Zsolt
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
[3] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
Ni, Han
Chen, Jia
Zhu, DaYong
Shi, Dianxi
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
[4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
Wu, Siying
Fu, Xueyang
Wu, Feng
Zha, Zheng-Jun
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
[5] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Frank, Stella
Bugliarello, Emanuele
Elliott, Desmond
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
[6] Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
Ramshetty, Shivaen
Verma, Gaurav
Kumar, Srijan
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15974 - 15990
[7] Cross-modal Map Learning for Vision and Language Navigation
Georgakis, Georgios
Schmeckpeper, Karl
Wanchoo, Karan
Dan, Soham
Miltsakaki, Eleni
Roth, Dan
Daniilidis, Kostas
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
[8] ENVEDIT: Environment Editing for Vision-and-Language Navigation
Li, Jialu
Tan, Hao
Bansal, Mohit
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
[9] Diagnosing the Environment Bias in Vision-and-Language Navigation
Zhang, Yubo
Tan, Hao
Bansal, Mohit
[J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
[10] Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
Li, Jialu
Tan, Hao
Bansal, Mohit
[J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1041 - 1050

← 1 2 3 4 5 →