Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

被引:0
|
作者
Gao, Fang [1 ,2 ]
Tang, Jingfeng [1 ]
Wang, Jiabao [1 ]
Li, Shaodong [1 ]
Yu, Jun [3 ]
机构
[1] Guangxi Univ, Sch Elect Engn, Nanning 530004, Peoples R China
[2] Anhui Key Lab Bion Sensing & Adv Robot Technol, Hefei 230031, Peoples R China
[3] Univ Sci & Technol China, Dept Automat, Hefei 230027, Peoples R China
来源
关键词
Embodied AI; vision-and-language navigation; natural language generation; knowledge enhancement;
D O I
10.1109/LRA.2024.3483042
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
引用
收藏
页码:10874 / 10881
页数:8
相关论文
共 50 条
  • [1] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
    Zheng, Qi
    Liu, Daqing
    Wang, Chaoyue
    Zhang, Jing
    Wang, Dadong
    Tao, Dacheng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 254 - 274
  • [2] Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge
    Zhan, Zhaohuan
    Qin, Jinghui
    Zhuo, Wei
    Tan, Guang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9745 - 9756
  • [3] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
    Li, Xiangyang
    Wang, Zihan
    Yang, Jiahao
    Wang, Yaowei
    Jiang, Shuqiang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
  • [4] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [5] SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
    Moudgil, Abhinav
    Majumdar, Arjun
    Agrawal, Harsh
    Lee, Stefan
    Batra, Dhruv
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [6] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [7] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [9] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [10] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173