Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引:1
|
作者
Wang, Jiawei [1 ]
Wang, Teng [2 ]
Cai, Wenzhe [2 ]
Xu, Lele [2 ]
Sun, Changyin [2 ,3 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China
[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期
基金
中国国家自然科学基金;
关键词
Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;
D O I
10.1109/LRA.2024.3511402
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.
引用
收藏
页码:612 / 619
页数:8
相关论文
共 50 条
  • [41] GridMM: Grid Memory Map for Vision-and-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Jiang, Shuqiang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
  • [42] InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
    Liu, Haogeng
    You, Quanzeng
    Wang, Yiqi
    Han, Xiaotian
    Zhai, Bohan
    Liu, Yongfei
    Chen, Wentao
    Jian, Yiren
    Tao, Yunzhe
    Yuan, Jianbo
    He, Ran
    Yang, Hongxia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 485 - 492
  • [43] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
    Li, Xiangyang
    Wang, Zihan
    Yang, Jiahao
    Wang, Yaowei
    Jiang, Shuqiang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
  • [44] Sub-Instruction Aware Vision-and-Language Navigation
    Hong, Yicong
    Rodriguez-Opazo, Cristian
    Wu, Qi
    Gould, Stephen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376
  • [45] Action Inference for Destination Prediction in Vision-and-Language Navigation
    Kondapally, Anirudh Reddy
    Yamada, Kentaro
    Yanaka, Hitomi
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 210 - 217
  • [46] OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
    Cartella, Giuseppe
    Baldrati, Alberto
    Morelli, Davide
    Cornia, Marcella
    Bertini, Marco
    Cucchiara, Rita
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2023, PT I, 2023, 14233 : 245 - 256
  • [47] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [48] Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation
    Chen, Peihao
    Ji, Dongyu
    Lin, Kunyang
    Zeng, Runhao
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [49] Federated Learning for Vision-and-Language Grounding Problems
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Fan, Wei
    Zou, Yuexian
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
  • [50] Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    Hu, Ronghang
    Fried, Daniel
    Rohrbach, Anna
    Klein, Dan
    Darrell, Trevor
    Saenko, Kate
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6551 - 6557