Memory-Adaptive Vision-and-Language Navigation

被引：2

作者：

He, Keji ^{[1
,2
]}

Jing, Ya ^{[3
]}

Huang, Yan ^{[1
,2
]}

Lu, Zhihe ^{[4
]}

An, Dong ^{[1
,5
]}

Wang, Liang ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] ByteDance AI Lab, Beijing, Peoples R China

[4] Natl Univ Singapore, Singapore, Singapore

[5] Univ Chinese Acad Sci, Sch Future Technol, Beijing, Peoples R China

来源：

PATTERN RECOGNITION | 2024年 / 153卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Vision-and-Language Navigation; Memory bank; History noises; Memory-Adaptive Model;

D O I：

10.1016/j.patcog.2024.110511

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision -and -Language Navigation (VLN) requests an agent to navigate in 3D environments following given instructions, where history is critical for decision -making in dynamic navigation process. Particularly, a memory bank storing histories is widely used in existing methods to incorporate with multimodel representations in current scenes for better decision -making. However, by weighting each history with a simple scalar, those methods cannot purely utilize the informative cues that co -exist with detrimental contents in each history, thereby inevitably introducing noises into decision -making. To that end, we propose a novel Memory -Adaptive Model (MAM) that can dynamically restrain the detrimental contents in histories for retaining contents that benefit navigation only. Specifically, two key modules, Visual and Textual Adaptive Modules, are designed to restrain history noises based on scene -related vision and text, respectively. A Reliability Estimator Module is further introduced to refine above adaptation operations. Our experiments on the widely used RxR and R2R datasets show that MAM outperforms its baseline method by 4.0% / 2.5% and 2% / 1% on the validation unseen/test split, respectively, wrt the SR metric.

引用

页数：13

共 50 条

[1] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
Zheng, Qi
Liu, Daqing
Wang, Chaoyue
Zhang, Jing
Wang, Dadong
Tao, Dacheng
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
[2] GridMM: Grid Memory Map for Vision-and-Language Navigation
Wang, Zihan
Li, Xiangyang
Yang, Jiahao
Liu, Yeqi
Jiang, Shuqiang
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
[3] Iterative Vision-and-Language Navigation
Krantz, Jacob
Banerjee, Shurjo
Zhu, Wang
Corso, Jason
Anderson, Peter
Lee, Stefan
Thomason, Jesse
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
[4] On the Evaluation of Vision-and-Language Navigation Instructions
Zhao, Ming
Anderson, Peter
Jain, Vihan
Wang, Su
Ku, Alexander
Baldridge, Jason
Ie, Eugene
[J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
[5] Recent Advances in Vision-and-language Navigation
Sima S.-L.
Huang Y.
He K.-J.
An D.
Yuan H.
Wang L.
[J]. Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
[6] Curriculum Learning for Vision-and-Language Navigation
Zhang, Jiwen
Wei, Zhongyu
Fan, Jianqing
Peng, Jiajie
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[7] Episodic Transformer for Vision-and-Language Navigation
Pashevich, Alexander
Schmid, Cordelia
Sun, Chen
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
[8] WebVLN: Vision-and-Language Navigation on Websites
Chen, Qi
Pitawela, Dileepa
Zhao, Chongyang
Zhou, Gengze
Chen, Hsiang-Ting
Wu, Qi
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
[9] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
Lin, Chuang
Jiang, Yi
Cai, Jianfei
Qu, Lizhen
Haffari, Gholamreza
Yuan, Zehuan
[J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397
[10] Local Slot Attention for Vision-and-Language Navigation
Zhuang, Yifeng
Sun, Qiang
Fu, Yanwei
Chen, Lifeng
Xue, Xiangyang
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553

← 1 2 3 4 5 →