Vision-Language Navigation With Beam-Constrained Global Normalization

被引:4
|
作者
Xie, Liang [1 ,2 ]
Zhang, Meishan [3 ]
Li, You [4 ]
Qin, Wei [1 ,2 ]
Yan, Ye [1 ,2 ]
Yin, Erwei [1 ,2 ]
机构
[1] Acad Mil Sci China, Natl Innovat Inst Def Technol, Beijing 100071, Peoples R China
[2] Tianjin Artificial Intelligence Innovat Ctr TAI, Tianjin 300450, Peoples R China
[3] Harbin Inst Technol Shenzhen, Inst Comp & Intelligence, Shenzhen 518055, Peoples R China
[4] China Astronaut Res & Training Ctr, Natl Key Lab Human Factors Engn, Beijing 100094, Peoples R China
基金
中国国家自然科学基金;
关键词
Trajectory; Navigation; Visualization; Task analysis; Training; Natural languages; Decoding; Beam search; global normalization; sequence to sequence; vision-language navigation (VLN);
D O I
10.1109/TNNLS.2022.3183287
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation (VLN) is a challenging task, which guides an agent to navigate in a realistic environment by natural language instructions. Sequence-to-sequence modeling is one of the most prospective architectures for the task, which achieves the agent navigation goal by a sequence of moving actions. The line of work has led to the state-of-the-art performance. Recently, several studies showed that the beam-search decoding during the inference can result in promising performance, as it ranks multiple candidate trajectories by scoring each trajectory as a whole. However, the trajectory-level score might be seriously biased during ranking. The score is a simple averaging of individual unit scores of the target-sequence actions, and these unit scores could be incomparable among different trajectories since they are calculated by a local discriminant classifier. To address this problem, we propose a global normalization strategy to rescale the scores at the trajectory level. Concretely, we present two global score functions to rerank all candidates in the output beam, resulting in more comparable trajectory scores. In this way, the bias problem can be greatly alleviated. We conduct experiments on the benchmark room-to-room (R2R) dataset of VLN to verify our method, and the results show that the proposed global method is effective, providing significant performance than the corresponding baselines. Our final model can achieve competitive performance on the VLN leaderboard.
引用
收藏
页码:1352 / 1363
页数:12
相关论文
共 50 条
  • [41] Core Challenges in Embodied Vision-Language Planning
    Francis J.
    Kitamura N.
    Labelle F.
    Lu X.
    Navarro I.
    Oh J.
    [J]. Journal of Artificial Intelligence Research, 2022, 74 : 459 - 515
  • [42] Exploring Vision-Language Models for Imbalanced Learning
    Wang Y.
    Yu Z.
    Wang J.
    Heng Q.
    Chen H.
    Ye W.
    Xie R.
    Xie X.
    Zhang S.
    [J]. International Journal of Computer Vision, 2024, 132 (01) : 224 - 237
  • [43] Learning to Prompt for Vision-Language Emotion Recognition
    Xie, Hongxia
    Chung, Hua
    Shuai, Hong-Han
    Cheng, Wen-Huang
    [J]. 2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
  • [44] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [45] Unsupervised Prototype Adapter for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Hu, Xueting
    He, Zhihai
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 197 - 209
  • [46] Neural Implicit Vision-Language Feature Fields
    Blomqvist, Kenneth
    Milano, Francesco
    Chung, Jen Jen
    Ott, Lionel
    Siegwart, Roland
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1313 - 1318
  • [47] Vision-language integration in AI: a reality check
    Pastra, K
    Wilks, Y
    [J]. ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 937 - 941
  • [48] Task Bias in Contrastive Vision-Language Models
    Menon, Sachit
    Chandratreya, Ishaan Preetam
    Vondrick, Carl
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2026 - 2040
  • [49] Perceptual Grouping in Contrastive Vision-Language Models
    Ranasinghe, Kanchana
    McKinzie, Brandon
    Ravi, Sachin
    Yang, Yinfei
    Toshev, Alexander
    Shlens, Jonathon
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5548 - 5561
  • [50] Survey on Vision-language Pre-training
    Yin J.
    Zhang Z.-D.
    Gao Y.-H.
    Yang Z.-W.
    Li L.
    Xiao M.
    Sun Y.-Q.
    Yan C.-G.
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023