Target-Driven Structured Transformer Planner for Vision-Language Navigation

被引：18

作者：

Zhao, Yusheng ^{[1
]}

Chen, Jinyu ^{[1
]}

Gao, Chen ^{[1
]}

Wang, Wenguan ^{[2
]}

Yang, Lirong ^{[3
]}

Ren, Haibing ^{[3
]}

Xia, Huaxia ^{[3
]}

Liu, Si ^{[4
]}

机构：

[1] Beihang Univ, Hangzhou Innovat Inst, Inst Artificial Intelligence, Beijing, Peoples R China

[2] Univ Technol Sydney, ReLER, AAII, Sydney, Australia

[3] Meituan Inc, Beijing, Peoples R China

[4] Beihang Univ, SCSE, Inst Artificial Intelligence, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

Vision-language Navigation; Target-driven Planner; Imaginary Scene Tokenization; Structured Transformer;

D O I：

10.1145/3503161.3548281

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visuallinguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD- STP.

引用

页码：4194 / 4203

页数：10

共 50 条

[21] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
Burns, Andrea
Arsan, Deniz
Agrawal, Sanjna
Kumar, Ranjitha
Saenko, Kate
Plummer, Bryan A.
[J]. COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 312 - 328
[22] ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhong, Zhihang
Cheng, Mingxi
Wu, Zhirong
Yuan, Yuhui
Zheng, Yinqiang
Li, Ji
Hu, Han
Lin, Stephen
Sato, Yoichi
Sato, Imari
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 294 - 304
[23] Vision-Language Navigation With Beam-Constrained Global Normalization
Xie, Liang
Zhang, Meishan
Li, You
Qin, Wei
Yan, Ye
Yin, Erwei
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 1352 - 1363
[24] Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation
Lin, Bingqian
Zhu, Yi
Long, Yanxin
Liang, Xiaodan
Ye, Qixiang
Lin, Liang
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 7175 - 7189
[25] An Improved Indoor Navigation Scheme Based on Vision-Language Localization
Xu, Ziheng
Jia, Zixi
Zhou, Xuegang
Wen, Huan
Li, Yanan
[J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1047 - 1051
[26] Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
Liang, Xiwen
Zhu, Fengda
Zhu, Yi
Lin, Bingqian
Wang, Bing
Liang, Xiaodan
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1592 - 1600
[27] Episodic Transformer for Vision-and-Language Navigation
Pashevich, Alexander
Schmid, Cordelia
Sun, Chen
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
[28] Target-driven Indoor Visual Navigation Using Inverse Reinforcement Learning
Wang, Xitong
Fang, Qiang
Xu, Xin
[J]. 2020 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING AND ARTIFICIAL INTELLIGENCE, 2020, 11584
[29] INDOOR TARGET-DRIVEN VISUAL NAVIGATION BASED ON SPATIAL SEMANTIC INFORMATION
Yan, Jiaojie
Zhang, Qieshi
Cheng, Jun
Ren, Ziliang
Li, Tian
Yang, Zhuo
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 571 - 575
[30] Visual Context Embedding and Deadlock Processing for Target-driven Visual Navigation
Kim, Jin-Hwan
Choi, Jeong-Hyun
Kim, Incheol
[J]. Journal of Institute of Control, Robotics and Systems, 2023, 29 (01): : 35 - 47

← 1 2 3 4 5 →