HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

被引：24

作者：

Qiao, Yanyuan ^{[1
]}

Qi, Yuankai ^{[1
]}

Hong, Yicong ^{[2
]}

Yu, Zheng ^{[1
]}

Wang, Peng ^{[3
]}

Wu, Qi ^{[1
]}

机构：

[1] Univ Adelaide, Adelaide, SA, Australia

[2] Australian Natl Univ, Canberra, ACT, Australia

[3] Northwestern Polytech Univ, Xian, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01498

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-training has been adopted in a few of recent works for Vision-and-Language Navigation (VLN). However, previous pre-training methods for VLN either lack the ability to predict future actions or ignore the trajectory contexts, which are essential for a greedy navigation process. In this work, to promote the learning of spatio-temporal visual-textual correspondence as well as the agent's capability of decision making, we propose a novel history-and-order aware pre-training paradigm (HOP) with VLN-specific objectives that exploit the past observations and support future action prediction. Specifically, in addition to the commonly used Masked Language Modeling (MLM) and Trajectory-Instruction Matching (TIM), we design two proxy tasks to model temporal order information: Trajectory Order Modeling (TOM) and Group Order Modeling (GOM). Moreover, our navigation action prediction is also enhanced by introducing the task of Action Prediction with History (APH), which takes into account the history visual perceptions. Extensive experimental results on four downstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed method compared against several state-of-the-art agents.

引用

页码：15397 / 15406

页数：10

共 50 条

[1] HOP plus : History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation
Qiao, Yanyuan
Qi, Yuankai
Hong, Yicong
Yu, Zheng
Wang, Peng
Wu, Qi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 8524 - 8537
[2] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
Wu, Siying
Fu, Xueyang
Wu, Feng
Zha, Zheng-Jun
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
[3] Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Cui, Yibo
Xie, Liang
Zhang, Yakun
Zhang, Meishan
Yan, Ye
Yin, Erwei
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12009 - 12019
[4] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
Qi, Qiaosong
Zhang, Aixi
Liao, Yue
Sun, Wenyu
Wang, Yongliang
Li, Xiaobo
Liu, Si
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
[5] History Aware Multimodal Transformer for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Schmid, Cordelia
Laptev, Ivan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[6] Weakly Supervised Vision-and-Language Pre-training with Relative Representations
Chen, Chi
Li, Peng
Sun, Maosong
Liu, Yang
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 8341 - 8355
[7] Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Li, Liunian Harold
You, Haoxuan
Wang, Zhecan
Zareian, Alireza
Chang, Shih-Fu
Chang, Kai-Wei
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5339 - 5350
[8] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Chen, Zhihong
Li, Guanbin
Wan, Xiang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161
[9] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
Chen, Zhihong
Du, Yuhao
Hu, Jinpeng
Liu, Yang
Li, Guanbin
Wan, Xiang
Chang, Tsung-Hui
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 679 - 689
[10] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Chen, Zhihong
Diao, Shizhe
Wang, Benyou
Li, Guanbin
Wan, Xiang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23346 - 23356

← 1 2 3 4 5 →