Local Slot Attention for Vision-and-Language Navigation

被引：1

作者：

Zhuang, Yifeng ^{[1
]}

Sun, Qiang ^{[1
]}

Fu, Yanwei ^{[2
]}

Chen, Lifeng ^{[1
]}

Xue, Xiangyang ^{[1
]}

机构：

[1] Fudan Univ, Shanghai, Peoples R China

[2] Fudan Univ, Sch Data Sci, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022 | 2022年

关键词：

vision-and-language navigation; slot attention; local attention;

D O I：

10.1145/3512527.3531366

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.

引用

页码：545 / 553

页数：9

共 50 条

[21] Diagnosing Vision-and-Language Navigation: What Really Matters
Zhu, Wanrong
Qi, Yuankai
Narayana, Pradyumna
Sone, Kazoo
Basu, Sugato
Wang, Eric Xin
Wu, Qi
Eckstein, Miguel
Wang, William Yang
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993
[22] History Aware Multimodal Transformer for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Schmid, Cordelia
Laptev, Ivan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[23] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
Jain, Vihan
Magalhaes, Gabriel
Ku, Alexander
Vaswani, Ashish
Ie, Eugene
Baldridge, Jason
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872
[24] Speaker-Follower Models for Vision-and-Language Navigation
Fried, Daniel
Hu, Ronghang
Cirik, Volkan
Rohrbach, Anna
Andreas, Jacob
Morency, Louis-Philippe
Berg-Kirkpatrick, Taylor
Saenko, Kate
Klein, Dan
Darrell, Trevor
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[25] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
Zheng, Qi
Liu, Daqing
Wang, Chaoyue
Zhang, Jing
Wang, Dadong
Tao, Dacheng
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
[26] DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama
223-8522, Japan
不详
305-8560, Japan
[J]. Sensors, 2025, 25 (02)
[27] Airbert: In-domain Pretraining for Vision-and-Language Navigation
Guhur, Pierre-Louis
Tapaswi, Makarand
Chen, Shizhe
Laptev, Ivan
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1614 - 1623
[28] GridMM: Grid Memory Map for Vision-and-Language Navigation
Wang, Zihan
Li, Xiangyang
Yang, Jiahao
Liu, Yeqi
Jiang, Shuqiang
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
[29] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
Li, Xiangyang
Wang, Zihan
Yang, Jiahao
Wang, Yaowei
Jiang, Shuqiang
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
[30] Sub-Instruction Aware Vision-and-Language Navigation
Hong, Yicong
Rodriguez-Opazo, Cristian
Wu, Qi
Gould, Stephen
[J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376

← 1 2 3 4 5 →