Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

被引:205
|
作者
Wang, Xin [1 ]
Huang, Qiuyuan [2 ]
Celikyilmaz, Asli [2 ]
Gao, Jianfeng [2 ]
Shen, Dinghan [3 ]
Wang, Yuan-Fang [1 ]
Wang, William Yang [1 ]
Zhang, Lei [2 ]
机构
[1] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA
[2] Microsoft Res, Redmond, WA USA
[3] Duke Univ, Durham, NC 27706 USA
关键词
D O I
10.1109/CVPR.2019.00679
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and tV rajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).
引用
收藏
页码:3622 / 6631
页数:3010
相关论文
共 50 条
  • [1] Cross-Modal Concept Learning and Inference for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Tang, Yushun
    He, Zhihai
    [J]. NEUROCOMPUTING, 2024, 583
  • [2] Self-Supervised Correlation Learning for Cross-Modal Retrieval
    Liu, Yaxin
    Wu, Jianlong
    Qu, Leigang
    Gan, Tian
    Yin, Jianhua
    Nie, Liqiang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
  • [3] Cross-modal Map Learning for Vision and Language Navigation
    Georgakis, Georgios
    Schmeckpeper, Karl
    Wanchoo, Karan
    Dan, Soham
    Miltsakaki, Eleni
    Roth, Dan
    Daniilidis, Kostas
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
  • [4] SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
    Chen, Yi-Syuan
    Song, Yun-Zhu
    Yeo, Cheng Yu
    Liu, Bei
    Fu, Jianlong
    Shuai, Hong-Han
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15384 - 15396
  • [5] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
    Khare, Aparna
    Parthasarathy, Srinivas
    Sundaram, Shiva
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
  • [6] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
    Das, Srijan
    Ryoo, Michael
    [J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
  • [7] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
    Alwassel, Humam
    Mahajan, Dhruv
    Korbar, Bruno
    Torresani, Lorenzo
    Ghanem, Bernard
    Tran, Du
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [8] Self-Supervised Visual Representations for Cross-Modal Retrieval
    Patel, Yash
    Gomez, Lluis
    Rusinol, Marcal
    Karatzas, Dimosthenis
    Jawahar, C., V
    [J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
  • [9] Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning
    Salvador, Amaia
    Gundogdu, Erhan
    Bazzani, Loris
    Donoser, Michael
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15470 - 15479
  • [10] Learning Mutual Modulation for Self-supervised Cross-Modal Super-Resolution
    Dong, Xiaoyu
    Yokoya, Naoto
    Wang, Longguang
    Uezato, Tatsumi
    [J]. COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 1 - 18