Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引：1

作者：

Wang, Jiawei ^{[1
]}

Wang, Teng ^{[2
]}

Cai, Wenzhe ^{[2
]}

Xu, Lele ^{[2
]}

Sun, Changyin ^{[2
,3
]}

机构：

[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China

[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China

[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;

D O I：

10.1109/LRA.2024.3511402

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.

引用

页码：612 / 619

页数：8

共 50 条

[21] Local Slot Attention for Vision-and-Language Navigation
Zhuang, Yifeng
Sun, Qiang
Fu, Yanwei
Chen, Lifeng
Xue, Xiangyang
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
[22] Improved Speaker and Navigator for Vision-and-Language Navigation
Wu, Zongkai
Liu, Zihan
Wang, Ting
Wang, Donglin
IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
[23] ENVEDIT: Environment Editing for Vision-and-Language Navigation
Li, Jialu
Tan, Hao
Bansal, Mohit
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
[24] Diagnosing the Environment Bias in Vision-and-Language Navigation
Zhang, Yubo
Tan, Hao
Bansal, Mohit
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
[25] Topological Planning with Transformers for Vision-and-Language Navigation
Chen, Kevin
Chen, Junshen K.
Chuang, Jo
Vazquez, Marynel
Savarese, Silvio
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
[26] Scaling Data Generation in Vision-and-Language Navigation
Wang, Zun
Li, Jialu
Hong, Yicong
Wang, Yi
Wu, Qi
Bansal, Mohit
Gould, Stephen
Tan, Hao
Qiao, Yu
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
[27] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
Liu, Shubo
Zhang, Hongsheng
Qi, Yuankai
Wang, Peng
Zhang, Yanning
Wu, Qi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
[28] Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Tapaswi, Makarand
Schmid, Cordelia
Laptev, Ivan
COMPUTER VISION, ECCV 2022, PT XXXIX, 2022, 13699 : 638 - 655
[29] Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning
Wang, Ting
Wu, Zongkai
Wang, Donglin
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 5193 - 5199
[30] A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Kamath, Aishwarya
Anderson, Peter
Wang, Su
Koh, Jing Yu
Ku, Alexander
Waters, Austin
Yang, Yinfei
Baldridge, Jason
Parekh, Zarana
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10813 - 10823

← 1 2 3 4 5 →