ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

被引:14
|
作者
Lin, Bingqian [1 ,2 ]
Zhu, Yi [2 ]
Chen, Zicong [1 ]
Liang, Xiwen [1 ]
Liu, Jianzhuang [2 ]
Liang, Xiaodan [1 ]
机构
[1] Sun Yat Sen Univ, Shenzhen Campus, Shenzhen, Peoples R China
[2] Huawei Noahs Ark Lab, Hong Kong, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
10.1109/CVPR52688.2022.01496
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i.e., make instruction-asked actions sequentially in complex visual environments. Most existing VLN agents learn the instruction-path data directly and cannot sufficiently explore action-level alignment knowledge inside the multi-modal inputs. In this paper, we propose modAlity-aligneD Action PrompTs (ADAPT), which provides the VLN agent with action prompts to enable the explicit learning of action-level modality alignment to pursue successful navigation. Specifically, an action prompt is defined as a modality-aligned pair of an image sub-prompt and a text sub-prompt, where the former is a single-view observation and the latter is a phrase like "walk past the chair". When starting navigation, the instruction-related action prompt set is retrieved from a pre-built action prompt base and passed through a prompt encoder to obtain the prompt feature. Then the prompt feature is concatenated with the original instruction feature and fed to a multi-layer transformer for action prediction. To collect high-quality action prompts into the prompt base, we use the Contrastive Language-Image Pretraining (CUP) model which has powerful cross-modality alignment ability A modality alignment loss and a sequential consistency loss are further introduced to enhance the alignment of the action prompt and enforce the agent to focus on the related prompt sequentially. Experimental results on both R2R and RxR show the superiority of ADAPT over state-of-the-art methods.
引用
收藏
页码:15375 / 15385
页数:11
相关论文
共 44 条
  • [1] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    [J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [2] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    [J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
  • [3] Category-Specific Prompts for Animal Action Recognition with Pretrained Vision-Language Models
    Jing, Yinuo
    Wang, Chunyu
    Zhang, Ruxu
    Liang, Kongming
    Ma, Zhanyu
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5716 - 5724
  • [4] Vision-Language Navigation Policy Learning and Adaptation
    Wang, Xin
    Huang, Qiuyuan
    Celikyilmaz, Asli
    Gao, Jianfeng
    Shen, Dinghan
    Wang, Yuan-Fang
    Wang, William Yang
    Zhang, Lei
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (12) : 4205 - 4216
  • [5] Vision-Language Navigation with Random Environmental Mixup
    Liu, Chong
    Zhu, Fengda
    Chang, Xiaojun
    Liang, Xiaodan
    Ge, Zongyuan
    Shen, Yi-Dong
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1624 - 1634
  • [6] Structured Scene Memory for Vision-Language Navigation
    Wang, Hanqing
    Wang, Wenguan
    Liang, Wei
    Xiong, Caiming
    Shen, Jianbing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
  • [7] Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
    Cheng, Wenhao
    Dong, Xingping
    Khan, Salman
    Shen, Jianbing
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 309 - 329
  • [8] Vision-Language Navigation Algorithm Based on Cosine Similarity
    Jin Jie
    Liu Kaiyan
    Zha Shunkao
    [J]. LASER & OPTOELECTRONICS PROGRESS, 2021, 58 (16)
  • [9] DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
    Wang, Hanqing
    Liang, Wei
    Van Gool, Luc
    Wang, Wenguan
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10839 - 10849