Local Slot Attention for Vision-and-Language Navigation

被引:1
|
作者
Zhuang, Yifeng [1 ]
Sun, Qiang [1 ]
Fu, Yanwei [2 ]
Chen, Lifeng [1 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
关键词
vision-and-language navigation; slot attention; local attention;
D O I
10.1145/3512527.3531366
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 50 条
  • [21] Diagnosing Vision-and-Language Navigation: What Really Matters
    Zhu, Wanrong
    Qi, Yuankai
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, Eric Xin
    Wu, Qi
    Eckstein, Miguel
    Wang, William Yang
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993
  • [22] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [23] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    Jain, Vihan
    Magalhaes, Gabriel
    Ku, Alexander
    Vaswani, Ashish
    Ie, Eugene
    Baldridge, Jason
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872
  • [24] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [25] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
    Zheng, Qi
    Liu, Daqing
    Wang, Chaoyue
    Zhang, Jing
    Wang, Dadong
    Tao, Dacheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [26] DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
    Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama
    223-8522, Japan
    不详
    305-8560, Japan
    [J]. Sensors, 2025, 25 (02)
  • [27] Airbert: In-domain Pretraining for Vision-and-Language Navigation
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Chen, Shizhe
    Laptev, Ivan
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1614 - 1623
  • [28] GridMM: Grid Memory Map for Vision-and-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Jiang, Shuqiang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
  • [29] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
    Li, Xiangyang
    Wang, Zihan
    Yang, Jiahao
    Wang, Yaowei
    Jiang, Shuqiang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
  • [30] Sub-Instruction Aware Vision-and-Language Navigation
    Hong, Yicong
    Rodriguez-Opazo, Cristian
    Wu, Qi
    Gould, Stephen
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376