A Cross-Modal Object-Aware Transformer for Vision-and-Language Navigation

被引:0
|
作者
Ni, Han [1 ]
Chen, Jia [1 ]
Zhu, DaYong [1 ]
Shi, Dianxi [1 ]
机构
[1] Natl Univ Def Technol, Univ Elect Sci & Technol China, Changsha, Peoples R China
关键词
vision-and-language navigation; cross-modal object; transformer;
D O I
10.1109/ICTAI56018.2022.00149
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-language navigation (VLN) combines cross-modal object references and scene descriptions to provide a breadcrumb trail to a goal location. Whereas existing VLN approaches often do not take full advantage of cross-modal object information, this work proposes a transformer network with perceptual cross-modal object data that fuses and aligns the two cue features of object reference to help agents capture object features. In our method, linguistic object processing provides semantic-level contextual information for visual object features. With this design, our model is able to leverage object features to assist the agent in substantially improving performance on the R2R and R4R benchmarks. Through extensive experiments on R2R and R4R, we demonstrate the effectiveness of the proposed model, and our method improves the absolute 1.6% in SPL on R2R and 2.1% in CLS on R4R. Our analysis shows that the network performs better when focusing on longer heavily object-referenced navigation instructions, which also indicates that our approach is better able to use object features and align them to references in the instructions.
引用
收藏
页码:976 / 981
页数:6
相关论文
共 50 条
  • [41] ENVEDIT: Environment Editing for Vision-and-Language Navigation
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
  • [42] Diagnosing the Environment Bias in Vision-and-Language Navigation
    Zhang, Yubo
    Tan, Hao
    Bansal, Mohit
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
  • [43] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
  • [44] Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
    Ilinykh, Nikolai
    Dobnik, Simon
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 4062 - 4073
  • [45] Scaling Data Generation in Vision-and-Language Navigation
    Wang, Zun
    Li, Jialu
    Hong, Yicong
    Wang, Yi
    Wu, Qi
    Bansal, Mohit
    Gould, Stephen
    Tan, Hao
    Qiao, Yu
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
  • [46] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
    Liu, Shubo
    Zhang, Hongsheng
    Qi, Yuankai
    Wang, Peng
    Zhang, Yanning
    Wu, Qi
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
  • [47] HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
    Qiao, Yanyuan
    Qi, Yuankai
    Hong, Yicong
    Yu, Zheng
    Wang, Peng
    Wu, Qi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15397 - 15406
  • [48] Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network
    Tan, Sinan
    Ge, Mengmeng
    Guo, Di
    Liu, Huaping
    Sun, Fuchun
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2022, 2022, : 9390 - 9396
  • [49] Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
    Wang, Xin
    Huang, Qiuyuan
    Celikyilmaz, Asli
    Gao, Jianfeng
    Shen, Dinghan
    Wang, Yuan-Fang
    Wang, William Yang
    Zhang, Lei
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3622 - 6631
  • [50] CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation
    Magassouba, Aly
    Sugiura, Komei
    Kawai, Hisashi
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (04) : 6258 - 6265