Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

被引:0
|
作者
Gao, Fang [1 ,2 ]
Tang, Jingfeng [1 ]
Wang, Jiabao [1 ]
Li, Shaodong [1 ]
Yu, Jun [3 ]
机构
[1] Guangxi Univ, Sch Elect Engn, Nanning 530004, Peoples R China
[2] Anhui Key Lab Bion Sensing & Adv Robot Technol, Hefei 230031, Peoples R China
[3] Univ Sci & Technol China, Dept Automat, Hefei 230027, Peoples R China
来源
关键词
Embodied AI; vision-and-language navigation; natural language generation; knowledge enhancement;
D O I
10.1109/LRA.2024.3483042
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
引用
收藏
页码:10874 / 10881
页数:8
相关论文
共 50 条
  • [41] Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding
    Hwang, Minyoung
    Jeong, Jaeyeon
    Kim, Minsoo
    Oh, Yoonseon
    Oh, Songhwai
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6683 - 6693
  • [42] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [43] Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    Hu, Ronghang
    Fried, Daniel
    Rohrbach, Anna
    Klein, Dan
    Darrell, Trevor
    Saenko, Kate
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6551 - 6557
  • [44] Cluster-based Curriculum Learning for Vision-and-Language Navigation
    Wang, Ting
    Wu, Zongkai
    Liu, Zihan
    Wang, Donglin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [45] Vision-and-Language Navigation via Latent Semantic Alignment Learning
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8406 - 8418
  • [46] Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
    Hwang, Jisu
    Kim, Incheol
    SENSORS, 2021, 21 (03) : 1 - 23
  • [47] FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation
    Zhou, Kaiwen
    Wang, Xin Eric
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 682 - 699
  • [48] Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation
    Yu, Felix
    Deng, Zhiwei
    Narasimhan, Karthik
    Russakovsky, Olga
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4000 - 4004
  • [49] Tree-Structured Trajectory Encoding for Vision-and-Language Navigation
    Zhou, Xinzhe
    Mu, Yadong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3814 - 3824
  • [50] LLM as Copilot for Coarse-Grained Vision-and-Language Navigation
    Qiao, Yanyuan
    Liu, Qianyi
    Liu, Jiajun
    Liu, Jing
    Wu, Qi
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 459 - 476