Enhancing Scene Understanding for Vision-and-Language Navigation by Knowledge Awareness

被引:0
|
作者
Gao, Fang [1 ,2 ]
Tang, Jingfeng [1 ]
Wang, Jiabao [1 ]
Li, Shaodong [1 ]
Yu, Jun [3 ]
机构
[1] Guangxi Univ, Sch Elect Engn, Nanning 530004, Peoples R China
[2] Anhui Key Lab Bion Sensing & Adv Robot Technol, Hefei 230031, Peoples R China
[3] Univ Sci & Technol China, Dept Automat, Hefei 230027, Peoples R China
来源
关键词
Embodied AI; vision-and-language navigation; natural language generation; knowledge enhancement;
D O I
10.1109/LRA.2024.3483042
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) has garnered widespread attention and research interest due to its potential applications in real-world scenarios. Despite significant progress in the VLN field in recent years, limitations persist. Many agents struggle to make accurate decisions when faced with similar candidate views during navigation, relying solely on the overall features of these views. This challenge primarily arises from the lack of common-sense knowledge about room layouts. Recognizing that room knowledge can establish relationships between rooms and objects in the environment, we construct room layout knowledge described in natural language by leveraging BLIP-2, including relationships between rooms and individual objects, relationships between objects, attributes of individual objects (such as color), and room types, thus providing comprehensive room layout information to the agent. We propose a Knowledge-Enhanced Scene Understanding (KESU) model to augment the agent's understanding of the environment by leveraging room layout knowledge. The Instruction Augmentation Module (IA) and the Knowledge History Fusion Module (KHF) in KESU respectively provide room layout knowledge for instructions and vision-history features, thereby enhancing the agent's navigation abilities. To more effectively integrate knowledge information with instruction features, we introduce Dynamic Residual Fusion (DRF) in the IA module. Finally, we conduct extensive experiments on the R2R, REVERIE, and SOON datasets, demonstrating the effectiveness of the proposed approach.
引用
收藏
页码:10874 / 10881
页数:8
相关论文
共 50 条
  • [31] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    Jain, Vihan
    Magalhaes, Gabriel
    Ku, Alexander
    Vaswani, Ashish
    Ie, Eugene
    Baldridge, Jason
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872
  • [32] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [33] DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
    Sun, Yanjun
    Qiu, Yue
    Aoki, Yoshimitsu
    SENSORS, 2025, 25 (02)
  • [34] Airbert: In-domain Pretraining for Vision-and-Language Navigation
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Chen, Shizhe
    Laptev, Ivan
    Schmid, Cordelia
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1614 - 1623
  • [35] GridMM: Grid Memory Map for Vision-and-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Jiang, Shuqiang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
  • [36] Sub-Instruction Aware Vision-and-Language Navigation
    Hong, Yicong
    Rodriguez-Opazo, Cristian
    Wu, Qi
    Gould, Stephen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376
  • [37] Learning Vision-and-Language Navigation from YouTube Videos
    Lin, Kunyang
    Chen, Peihao
    Huang, Diwei
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8283 - 8292
  • [38] Action Inference for Destination Prediction in Vision-and-Language Navigation
    Kondapally, Anirudh Reddy
    Yamada, Kentaro
    Yanaka, Hitomi
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 210 - 217
  • [39] KAT: A Knowledge Augmented Transformer for Vision-and-Language
    Gui, Liangke
    Wang, Borui
    Huang, Qiuyuan
    Hauptmann, Alexander
    Bisk, Yonatan
    Gao, Jianfeng
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
  • [40] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
    Chen, Zhihong
    Li, Guanbin
    Wan, Xiang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161