Target-Driven Structured Transformer Planner for Vision-Language Navigation

被引:18
|
作者
Zhao, Yusheng [1 ]
Chen, Jinyu [1 ]
Gao, Chen [1 ]
Wang, Wenguan [2 ]
Yang, Lirong [3 ]
Ren, Haibing [3 ]
Xia, Huaxia [3 ]
Liu, Si [4 ]
机构
[1] Beihang Univ, Hangzhou Innovat Inst, Inst Artificial Intelligence, Beijing, Peoples R China
[2] Univ Technol Sydney, ReLER, AAII, Sydney, Australia
[3] Meituan Inc, Beijing, Peoples R China
[4] Beihang Univ, SCSE, Inst Artificial Intelligence, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-language Navigation; Target-driven Planner; Imaginary Scene Tokenization; Structured Transformer;
D O I
10.1145/3503161.3548281
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visuallinguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD- STP.
引用
收藏
页码:4194 / 4203
页数:10
相关论文
共 50 条
  • [21] A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
    Burns, Andrea
    Arsan, Deniz
    Agrawal, Sanjna
    Kumar, Ranjitha
    Saenko, Kate
    Plummer, Bryan A.
    [J]. COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 312 - 328
  • [22] ClipCrop: Conditioned Cropping Driven by Vision-Language Model
    Zhong, Zhihang
    Cheng, Mingxi
    Wu, Zhirong
    Yuan, Yuhui
    Zheng, Yinqiang
    Li, Ji
    Hu, Han
    Lin, Stephen
    Sato, Yoichi
    Sato, Imari
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 294 - 304
  • [23] Vision-Language Navigation With Beam-Constrained Global Normalization
    Xie, Liang
    Zhang, Meishan
    Li, You
    Qin, Wei
    Yan, Ye
    Yin, Erwei
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 1352 - 1363
  • [24] Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation
    Lin, Bingqian
    Zhu, Yi
    Long, Yanxin
    Liang, Xiaodan
    Ye, Qixiang
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 7175 - 7189
  • [25] An Improved Indoor Navigation Scheme Based on Vision-Language Localization
    Xu, Ziheng
    Jia, Zixi
    Zhou, Xuegang
    Wen, Huan
    Li, Yanan
    [J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1047 - 1051
  • [26] Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
    Liang, Xiwen
    Zhu, Fengda
    Zhu, Yi
    Lin, Bingqian
    Wang, Bing
    Liang, Xiaodan
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1592 - 1600
  • [27] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [28] Target-driven Indoor Visual Navigation Using Inverse Reinforcement Learning
    Wang, Xitong
    Fang, Qiang
    Xu, Xin
    [J]. 2020 INTERNATIONAL CONFERENCE ON IMAGE, VIDEO PROCESSING AND ARTIFICIAL INTELLIGENCE, 2020, 11584
  • [29] INDOOR TARGET-DRIVEN VISUAL NAVIGATION BASED ON SPATIAL SEMANTIC INFORMATION
    Yan, Jiaojie
    Zhang, Qieshi
    Cheng, Jun
    Ren, Ziliang
    Li, Tian
    Yang, Zhuo
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 571 - 575
  • [30] Visual Context Embedding and Deadlock Processing for Target-driven Visual Navigation
    Kim, Jin-Hwan
    Choi, Jeong-Hyun
    Kim, Incheol
    [J]. Journal of Institute of Control, Robotics and Systems, 2023, 29 (01): : 35 - 47