Local Slot Attention for Vision-and-Language Navigation

被引:1
|
作者
Zhuang, Yifeng [1 ]
Sun, Qiang [1 ]
Fu, Yanwei [2 ]
Chen, Lifeng [1 ]
Xue, Xiangyang [1 ]
机构
[1] Fudan Univ, Shanghai, Peoples R China
[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China
关键词
vision-and-language navigation; slot attention; local attention;
D O I
10.1145/3512527.3531366
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
引用
收藏
页码:545 / 553
页数:9
相关论文
共 50 条
  • [41] VLN(sic)BERT: A Recurrent Vision-and-Language BERT for Navigation
    Hong, Yicong
    Wu, Qi
    Qi, Yuankai
    Rodriguez-Opazo, Cristian
    Gould, Stephen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1643 - 1653
  • [42] Survey on the Research Progress and Development Trend of Vision-and-Language Navigation
    Niu K.
    Wang P.
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (12): : 1815 - 1827
  • [43] Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
    Zhu, Wanrong
    Wang, Xin Eric
    Fu, Tsu-Jui
    Yan, An
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, William Yang
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1207 - 1221
  • [44] Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
    Gu, Jing
    Stefani, Eliana
    Wu, Qi
    Thomason, Jesse
    Wang, Xin Eric
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7606 - 7623
  • [45] Frequency-enhanced Data Augmentation for Vision-and-Language Navigation
    He, Keji
    Si, Chenyang
    Lu, Zhihe
    Huang, Yan
    Wang, Liang
    Wang, Xinchao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [46] Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
    Vasudevan, Arun Balajee
    Dai, Dengxin
    Van Gool, Luc
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (01) : 246 - 266
  • [47] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [48] Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
    Arun Balajee Vasudevan
    Dengxin Dai
    Luc Van Gool
    [J]. International Journal of Computer Vision, 2021, 129 : 246 - 266
  • [49] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    Irshad, Muhammad Zubair
    Ma, Chih-Yao
    Kira, Zsolt
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
  • [50] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
    Lin, Chuang
    Jiang, Yi
    Cai, Jianfei
    Qu, Lizhen
    Haffari, Gholamreza
    Yuan, Zehuan
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397