Episodic Transformer for Vision-and-Language Navigation

被引:19
|
作者
Pashevich, Alexander [1 ,2 ]
Schmid, Cordelia [2 ]
Sun, Chen [2 ,3 ]
机构
[1] INRIA, Le Chesnay, France
[2] Google Res, Mountain View, CA 94043 USA
[3] Brown Univ, Providence, RI 02912 USA
关键词
D O I
10.1109/ICCV48922.2021.01564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
引用
收藏
页码:15922 / 15932
页数:11
相关论文
共 50 条
  • [1] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
    Zheng, Qi
    Liu, Daqing
    Wang, Chaoyue
    Zhang, Jing
    Wang, Dadong
    Tao, Dacheng
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [2] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [4] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
    Lin, Chuang
    Jiang, Yi
    Cai, Jianfei
    Qu, Lizhen
    Haffari, Gholamreza
    Yuan, Zehuan
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397
  • [5] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [6] A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation
    Ni, Han
    Chen, Jia
    Zhu, DaYong
    Shi, Dianxi
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 976 - 981
  • [7] SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
    Moudgil, Abhinav
    Majumdar, Arjun
    Agrawal, Harsh
    Lee, Stefan
    Batra, Dhruv
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [8] Recent Advances in Vision-and-language Navigation
    Sima, Shuang-Lin
    Huang, Yan
    He, Ke-Ji
    An, Dong
    Yuan, Hui
    Wang, Liang
    [J]. Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [9] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [10] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173