Local Slot Attention for Vision-and-Language Navigation

被引:1
|
作者
Zhuang, Yifeng [1 ]
Sun, Qiang [1 ]
Fu, Yanwei [2 ]
Chen, Lifeng [1 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
关键词
vision-and-language navigation; slot attention; local attention;
D O I
10.1145/3512527.3531366
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 50 条
  • [1] GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation
    Huo, Jingyang
    Sun, Qiang
    Jiang, Boyan
    Lin, Haitao
    Fu, Yanwei
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23212 - 23221
  • [2] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [3] Multimodal attention networks for low-level vision-and-language navigation
    Landi, Federico
    Baraldi, Lorenzo
    Cornia, Marcella
    Corsini, Massimiliano
    Cucchiara, Rita
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
  • [4] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [5] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [6] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [8] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
  • [9] Improved Speaker and Navigator for Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Ting
    Wang, Donglin
    [J]. IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
  • [10] Memory-Adaptive Vision-and-Language Navigation
    He, Keji
    Jing, Ya
    Huang, Yan
    Lu, Zhihe
    An, Dong
    Wang, Liang
    [J]. PATTERN RECOGNITION, 2024, 153