Target-Driven Structured Transformer Planner for Vision-Language Navigation

被引:18
|
作者
Zhao, Yusheng [1 ]
Chen, Jinyu [1 ]
Gao, Chen [1 ]
Wang, Wenguan [2 ]
Yang, Lirong [3 ]
Ren, Haibing [3 ]
Xia, Huaxia [3 ]
Liu, Si [4 ]
机构
[1] Beihang Univ, Hangzhou Innovat Inst, Inst Artificial Intelligence, Beijing, Peoples R China
[2] Univ Technol Sydney, ReLER, AAII, Sydney, Australia
[3] Meituan Inc, Beijing, Peoples R China
[4] Beihang Univ, SCSE, Inst Artificial Intelligence, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Vision-language Navigation; Target-driven Planner; Imaginary Scene Tokenization; Structured Transformer;
D O I
10.1145/3503161.3548281
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visuallinguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2% and 5% on the test set of R2R and REVERIE benchmarks, respectively. Our code is available at https://github.com/YushengZhao/TD- STP.
引用
收藏
页码:4194 / 4203
页数:10
相关论文
共 50 条
  • [1] Structured Scene Memory for Vision-Language Navigation
    Wang, Hanqing
    Wang, Wenguan
    Liang, Wei
    Xiong, Caiming
    Shen, Jianbing
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
  • [2] Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation
    Gao, Chen
    Peng, Xingyu
    Yan, Mi
    Wang, He
    Yang, Lirong
    Ren, Haibing
    Li, Hongsheng
    Liu, Si
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14911 - 14920
  • [3] Reinforced Structured State-Evolution for Vision-Language Navigation
    Chen, Jinyu
    Gao, Chen
    Meng, Erli
    Zhang, Qiong
    Liu, Si
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15429 - 15438
  • [4] NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments
    Wang, Haitong
    Tan, Aaron Hao
    Nejat, Goldie
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (08): : 6808 - 6815
  • [5] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    [J]. INFORMATION FUSION, 2024, 108
  • [6] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    [J]. Machine Intelligence Research, 2023, 20 : 421 - 434
  • [7] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [8] TVLT: Textless Vision-Language Transformer
    Tang, Zineng
    Cho, Jaemin
    Nie, Yixin
    Bansal, Mohit
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    [J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [10] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    [J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316