Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引:1
|
作者
Wang, Jiawei [1 ]
Wang, Teng [2 ]
Cai, Wenzhe [2 ]
Xu, Lele [2 ]
Sun, Changyin [2 ,3 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期
基金
中国国家自然科学基金;
关键词
Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;
D O I
10.1109/LRA.2024.3511402
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.
引用
收藏
页码:612 / 619
页数:8
相关论文
共 50 条
  • [1] LLM as Copilot for Coarse-Grained Vision-and-Language Navigation
    Qiao, Yanyuan
    Liu, Qianyi
    Liu, Jiajun
    Liu, Jing
    Wu, Qi
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 459 - 476
  • [2] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing
    Chen, Jingwen
    Luo, Jianjie
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
  • [4] Discovering Intrinsic Subgoals for Vision-and-Language Navigation via Hierarchical Reinforcement Learning
    Wang, Jiawei
    Wang, Teng
    Xu, Lele
    He, Zichen
    Sun, Changyin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [5] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [6] Vision-and-Language Navigation via Causal Learning
    Wang, Liuyi
    He, Zongtao
    Dang, Ronghao
    Shen, Mengjiao
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
  • [7] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [8] Learning Vision-and-Language Navigation from YouTube Videos
    Lin, Kunyang
    Chen, Peihao
    Huang, Diwei
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8283 - 8292
  • [9] VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
    Qiao, Yanyuan
    Yu, Zheng
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15397 - 15406
  • [10] Cluster-based Curriculum Learning for Vision-and-Language Navigation
    Wang, Ting
    Wu, Zongkai
    Liu, Zihan
    Wang, Donglin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,