Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

被引：205

作者：

Wang, Xin ^{[1
]}

Huang, Qiuyuan ^{[2
]}

Celikyilmaz, Asli ^{[2
]}

Gao, Jianfeng ^{[2
]}

Shen, Dinghan ^{[3
]}

Wang, Yuan-Fang ^{[1
]}

Wang, William Yang ^{[1
]}

Zhang, Lei ^{[2
]}

机构：

[1] Univ Calif Santa Barbara, Santa Barbara, CA 93106 USA

[2] Microsoft Res, Redmond, WA USA

[3] Duke Univ, Durham, NC 27706 USA

来源：

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年

关键词：

D O I：

10.1109/CVPR.2019.00679

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and tV rajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

引用

页码：3622 / 6631

页数：3010

共 50 条

[1] Cross-Modal Concept Learning and Inference for Vision-Language Models
Zhang, Yi
Zhang, Ce
Tang, Yushun
He, Zhihai
[J]. NEUROCOMPUTING, 2024, 583
[2] Self-Supervised Correlation Learning for Cross-Modal Retrieval
Liu, Yaxin
Wu, Jianlong
Qu, Leigang
Gan, Tian
Yin, Jianhua
Nie, Liqiang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2851 - 2863
[3] Cross-modal Map Learning for Vision and Language Navigation
Georgakis, Georgios
Schmeckpeper, Karl
Wanchoo, Karan
Dan, Soham
Miltsakaki, Eleni
Roth, Dan
Daniilidis, Kostas
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
[4] SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Chen, Yi-Syuan
Song, Yun-Zhu
Yeo, Cheng Yu
Liu, Bei
Fu, Jianlong
Shuai, Hong-Han
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15384 - 15396
[5] SELF-SUPERVISED LEARNING WITH CROSS-MODAL TRANSFORMERS FOR EMOTION RECOGNITION
Khare, Aparna
Parthasarathy, Srinivas
Sundaram, Shiva
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 381 - 388
[6] Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning
Das, Srijan
Ryoo, Michael
[J]. 2023 18TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND APPLICATIONS, MVA, 2023,
[7] Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Alwassel, Humam
Mahajan, Dhruv
Korbar, Bruno
Torresani, Lorenzo
Ghanem, Bernard
Tran, Du
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[8] Self-Supervised Visual Representations for Cross-Modal Retrieval
Patel, Yash
Gomez, Lluis
Rusinol, Marcal
Karatzas, Dimosthenis
Jawahar, C., V
[J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
[9] Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning
Salvador, Amaia
Gundogdu, Erhan
Bazzani, Loris
Donoser, Michael
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15470 - 15479
[10] Learning Mutual Modulation for Self-supervised Cross-Modal Super-Resolution
Dong, Xiaoyu
Yokoya, Naoto
Wang, Longguang
Uezato, Tatsumi
[J]. COMPUTER VISION, ECCV 2022, PT XIX, 2022, 13679 : 1 - 18

← 1 2 3 4 5 →