Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

被引：0

作者：

Gao, Fang ^{[1
,2
]}

Tang, Jingfeng ^{[1
]}

Wang, Jiabao ^{[1
]}

Li, Shaodong ^{[1
]}

Yu, Jun ^{[3
]}

机构：

[1] Guangxi Univ, Sch Elect Engn, Nanning 530004, Peoples R China

[2] Anhui Key Lab Bion Sensing & Adv Robot Technol, Hefei 230031, Peoples R China

[3] Univ Sci & Technol China, Dept Automat, Hefei 230027, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 12期

关键词：

Embodied AI; vision-and-language navigation; natural language generation; knowledge enhancement;

D O I：

10.1109/LRA.2024.3483042

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.

引用

页码：10874 / 10881

页数：8

共 50 条

[1] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
Zheng, Qi
Liu, Daqing
Wang, Chaoyue
Zhang, Jing
Wang, Dadong
Tao, Dacheng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 254 - 274
[2] Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge
Zhan, Zhaohuan
Qin, Jinghui
Zhuo, Wei
Tan, Guang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9745 - 9756
[3] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Li, Xiangyang
Wang, Zihan
Yang, Jiahao
Wang, Yaowei
Jiang, Shuqiang
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
[4] Iterative Vision-and-Language Navigation
Krantz, Jacob
Banerjee, Shurjo
Zhu, Wang
Corso, Jason
Anderson, Peter
Lee, Stefan
Thomason, Jesse
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
[5] SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
Moudgil, Abhinav
Majumdar, Arjun
Agrawal, Harsh
Lee, Stefan
Batra, Dhruv
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[6] Recent Advances in Vision-and-language Navigation
Sima S.-L.
Huang Y.
He K.-J.
An D.
Yuan H.
Wang L.
Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
[7] Curriculum Learning for Vision-and-Language Navigation
Zhang, Jiwen
Wei, Zhongyu
Fan, Jianqing
Peng, Jiajie
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[8] On the Evaluation of Vision-and-Language Navigation Instructions
Zhao, Ming
Anderson, Peter
Jain, Vihan
Wang, Su
Ku, Alexander
Baldridge, Jason
Ie, Eugene
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
[9] Episodic Transformer for Vision-and-Language Navigation
Pashevich, Alexander
Schmid, Cordelia
Sun, Chen
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
[10] WebVLN: Vision-and-Language Navigation on Websites
Chen, Qi
Pitawela, Dileepa
Zhao, Chongyang
Zhou, Gengze
Chen, Hsiang-Ting
Wu, Qi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173

← 1 2 3 4 5 →