A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation

被引:0
|
作者
Ni, Han [1 ]
Chen, Jia [1 ]
Zhu, DaYong [1 ]
Shi, Dianxi [1 ]
机构
[1] Natl Univ Def Technol, Univ Elect Sci & Technol China, Changsha, Peoples R China
关键词
vision-and-language navigation; cross-modal object; transformer;
D O I
10.1109/ICTAI56018.2022.00149
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-language navigation (VLN) combines cross-modal object references and scene descriptions to provide a breadcrumb trail to a goal location. Whereas existing VLN approaches often do not take full advantage of cross-modal object information, this work proposes a transformer network with perceptual cross-modal object data that fuses and aligns the two cue features of object reference to help agents capture object features. In our method, linguistic object processing provides semantic-level contextual information for visual object features. With this design, our model is able to leverage object features to assist the agent in substantially improving performance on the R2R and R4R benchmarks. Through extensive experiments on R2R and R4R, we demonstrate the effectiveness of the proposed model, and our method improves the absolute 1.6% in SPL on R2R and 2.1% in CLS on R4R. Our analysis shows that the network performs better when focusing on longer heavily object-referenced navigation instructions, which also indicates that our approach is better able to use object features and align them to references in the instructions.
引用
收藏
页码:976 / 981
页数:6
相关论文
共 50 条
  • [1] SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
    Moudgil, Abhinav
    Majumdar, Arjun
    Agrawal, Harsh
    Lee, Stefan
    Batra, Dhruv
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [2] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    Irshad, Muhammad Zubair
    Ma, Chih-Yao
    Kira, Zsolt
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
  • [3] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [5] Vision-and-Language Navigation Based on Cross-Modal Feature Fusion in Indoor Environment
    Wen, Shuhuan
    Lv, Xiaohan
    Yu, F. Richard
    Gong, Simeng
    [J]. IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2023, 15 (01) : 3 - 15
  • [6] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
  • [7] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [8] Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
    Ramshetty, Shivaen
    Verma, Gaurav
    Kumar, Srijan
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15974 - 15990
  • [9] Cross-modal Map Learning for Vision and Language Navigation
    Georgakis, Georgios
    Schmeckpeper, Karl
    Wanchoo, Karan
    Dan, Soham
    Miltsakaki, Eleni
    Roth, Dan
    Daniilidis, Kostas
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15439 - 15449
  • [10] Transformer-Exclusive Cross-Modal Representation for Vision and Language
    Shin, Andrew
    Narihira, Takuya
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2719 - 2725