Vision-and-Language Navigation via Latent Semantic Alignment Learning

被引:1
|
作者
Wu, Siying [1 ]
Fu, Xueyang [2 ]
Wu, Feng [2 ]
Zha, Zheng-Jun [2 ]
机构
[1] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230039, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-and-language navigation; visual-language pre-training; semantic alignment; OBSTACLE AVOIDANCE; MOBILE;
D O I
10.1109/TMM.2024.3358112
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.
引用
收藏
页码:8406 / 8418
页数:13
相关论文
共 50 条
  • [1] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [3] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [4] Discovering Intrinsic Subgoals for Vision-and-Language Navigation via Hierarchical Reinforcement Learning
    Wang, Jiawei
    Wang, Teng
    Xu, Lele
    He, Zichen
    Sun, Changyin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [5] Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning
    Wang, Ting
    Wu, Zongkai
    Wang, Donglin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 5193 - 5199
  • [6] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [7] Learning Vision-and-Language Navigation from YouTube Videos
    Lin, Kunyang
    Chen, Peihao
    Huang, Diwei
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8283 - 8292
  • [8] Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
    Sun, Yanjun
    Qiu, Yue
    Aoki, Yoshimitsu
    Kataoka, Hirokatsu
    [J]. SENSORS, 2023, 23 (13)
  • [9] Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation
    Tan, Sinan
    Sima, Kuankuan
    Wang, Dunzheng
    Ge, Mengmeng
    Guo, Di
    Liu, Huaping
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [10] Auxiliary Fine-grained Alignment Constraints for Vision-and-Language Navigation
    Cui, Yibo
    Huang, Ruqiang
    Zhang, Yakun
    Cen, Yingjie
    Xie, Liang
    Yan, Ye
    Yin, Erwei
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2621 - 2626