Memory-Adaptive Vision-and-Language Navigation

被引:2
|
作者
He, Keji [1 ,2 ]
Jing, Ya [3 ]
Huang, Yan [1 ,2 ]
Lu, Zhihe [4 ]
An, Dong [1 ,5 ]
Wang, Liang [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, State Key Lab Multimodal Artificial Intelligence S, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] ByteDance AI Lab, Beijing, Peoples R China
[4] Natl Univ Singapore, Singapore, Singapore
[5] Univ Chinese Acad Sci, Sch Future Technol, Beijing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Vision-and-Language Navigation; Memory bank; History noises; Memory-Adaptive Model;
D O I
10.1016/j.patcog.2024.110511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision -and -Language Navigation (VLN) requests an agent to navigate in 3D environments following given instructions, where history is critical for decision -making in dynamic navigation process. Particularly, a memory bank storing histories is widely used in existing methods to incorporate with multimodel representations in current scenes for better decision -making. However, by weighting each history with a simple scalar, those methods cannot purely utilize the informative cues that co -exist with detrimental contents in each history, thereby inevitably introducing noises into decision -making. To that end, we propose a novel Memory -Adaptive Model (MAM) that can dynamically restrain the detrimental contents in histories for retaining contents that benefit navigation only. Specifically, two key modules, Visual and Textual Adaptive Modules, are designed to restrain history noises based on scene -related vision and text, respectively. A Reliability Estimator Module is further introduced to refine above adaptation operations. Our experiments on the widely used RxR and R2R datasets show that MAM outperforms its baseline method by 4.0% / 2.5% and 2% / 1% on the validation unseen/test split, respectively, wrt the SR metric.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
    Zheng, Qi
    Liu, Daqing
    Wang, Chaoyue
    Zhang, Jing
    Wang, Dadong
    Tao, Dacheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [2] GridMM: Grid Memory Map for Vision-and-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Jiang, Shuqiang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
  • [3] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [4] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [5] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [6] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [8] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
  • [9] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
    Lin, Chuang
    Jiang, Yi
    Cai, Jianfei
    Qu, Lizhen
    Haffari, Gholamreza
    Yuan, Zehuan
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397
  • [10] Local Slot Attention for Vision-and-Language Navigation
    Zhuang, Yifeng
    Sun, Qiang
    Fu, Yanwei
    Chen, Lifeng
    Xue, Xiangyang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553