Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

被引:0
|
作者
Gao, Fang [1 ,2 ]
Tang, Jingfeng [1 ]
Wang, Jiabao [1 ]
Li, Shaodong [1 ]
Yu, Jun [3 ]
机构
[1] Guangxi Univ, Sch Elect Engn, Nanning 530004, Peoples R China
[2] Anhui Key Lab Bion Sensing & Adv Robot Technol, Hefei 230031, Peoples R China
[3] Univ Sci & Technol China, Dept Automat, Hefei 230027, Peoples R China
来源
关键词
Embodied AI; vision-and-language navigation; natural language generation; knowledge enhancement;
D O I
10.1109/LRA.2024.3483042
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
引用
收藏
页码:10874 / 10881
页数:8
相关论文
共 50 条
  • [21] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
  • [22] Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant's Help
    Li, Xin
    Zhang, Yu
    Yuan, Weilin
    Luo, Junren
    APPLIED SCIENCES-BASEL, 2022, 12 (14):
  • [23] Scaling Data Generation in Vision-and-Language Navigation
    Wang, Zun
    Li, Jialu
    Hong, Yicong
    Wang, Yi
    Wu, Qi
    Bansal, Mohit
    Gould, Stephen
    Tan, Hao
    Qiao, Yu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
  • [24] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
    Liu, Shubo
    Zhang, Hongsheng
    Qi, Yuankai
    Wang, Peng
    Zhang, Yanning
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
  • [25] Vision-and-Language Navigation via Causal Learning
    Wang, Liuyi
    He, Zongtao
    Dang, Ronghao
    Shen, Mengjiao
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
  • [26] Reinforced Vision-and-Language Navigation Based on Historical BERT
    Zhang, Zixuan
    Qi, Shuhan
    Zhou, Zihao
    Zhang, Jiajia
    Yuan, Hao
    Wang, Xuan
    Wang, Lei
    Xiao, Jing
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2023, PT II, 2023, 13969 : 427 - 438
  • [27] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [28] Diagnosing Vision-and-Language Navigation: What Really Matters
    Zhu, Wanrong
    Qi, Yuankai
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, Eric Xin
    Wu, Qi
    Eckstein, Miguel
    Wang, William Yang
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993
  • [29] Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing
    Chen, Jingwen
    Luo, Jianjie
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
  • [30] Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
    Xu, Ming
    Xie, Zilong
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 10756 - 10763