Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation With Open-Sourced LLM

被引：1

作者：

Wang, Jiawei ^{[1
]}

Wang, Teng ^{[2
]}

Cai, Wenzhe ^{[2
]}

Xu, Lele ^{[2
]}

Sun, Changyin ^{[2
,3
]}

机构：

[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China

[2] Southeast Univ, Sch Automat, Nanjing 210096, Peoples R China

[3] Anhui Univ, Sch Artificial Intelligence, Hefei 230601, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2025年 / 10卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Navigation; Trajectory; Visualization; Reinforcement learning; Feature extraction; Cognition; Robots; Transformers; Sun; Large language models; Vision-and-language navigation (VLN); large language models; reinforcement learning (RL); attention; discriminator;

D O I：

10.1109/LRA.2024.3511402

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Vision-and-Language Navigation (VLN) requires an agent to navigate in photo-realistic environments based on language instructions. Existing methods typically employ imitation learning to train agents. However, approaches based on recurrent neural networks suffer from poor generalization, while transformer-based methods are too large in scale for practical deployment. In contrast, reinforcement learning (RL) agents can overcome dataset limitations and learn navigation policies that adapt to environment changes. However, without expert trajectories for supervision, agents struggle to learn effective long-term navigation policies from sparse environment rewards. Instruction decomposition enables agents to learn value estimation faster, making agents more efficient in learning VLN tasks. We propose the Decomposing Instructions with Large Language Models for Vision-and-Language Navigation (DILLM-VLN) method, which decomposes complex navigation instructions into simple, interpretable sub-instructions using a lightweight, open-sourced LLM and trains RL agents to complete these sub-instructions sequentially. Based on these interpretable sub-instructions, we introduce the cascaded multi-scale attention (CMA) and a novel multi-modal fusion discriminator (MFD). CMA integrates instruction features at different scales to provide precise textual guidance. MFD combines scene, object, and action information to comprehensively assess the completion of sub-instructions. Experiment results show that DILLM-VLN significantly improves baseline performance, demonstrating its potential for practical applications.

引用

页码：612 / 619

页数：8

共 50 条

[31] Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
Hong, Yicong
Wang, Zun
Wu, Qi
Gould, Stephen
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15418 - 15428
[32] Reinforced Vision-and-Language Navigation Based on Historical BERT
Zhang, Zixuan
Qi, Shuhan
Zhou, Zihao
Zhang, Jiajia
Yuan, Hao
Wang, Xuan
Wang, Lei
Xiao, Jing
ADVANCES IN SWARM INTELLIGENCE, ICSI 2023, PT II, 2023, 13969 : 427 - 438
[33] History Aware Multimodal Transformer for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Schmid, Cordelia
Laptev, Ivan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[34] Diagnosing Vision-and-Language Navigation: What Really Matters
Zhu, Wanrong
Qi, Yuankai
Narayana, Pradyumna
Sone, Kazoo
Basu, Sugato
Wang, Eric Xin
Wu, Qi
Eckstein, Miguel
Wang, William Yang
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993
[35] Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Xu, Ming
Xie, Zilong
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 10756 - 10763
[36] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
Jain, Vihan
Magalhaes, Gabriel
Ku, Alexander
Vaswani, Ashish
Ie, Eugene
Baldridge, Jason
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872
[37] Speaker-Follower Models for Vision-and-Language Navigation
Fried, Daniel
Hu, Ronghang
Cirik, Volkan
Rohrbach, Anna
Andreas, Jacob
Morency, Louis-Philippe
Berg-Kirkpatrick, Taylor
Saenko, Kate
Klein, Dan
Darrell, Trevor
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[38] ESceme: Vision-and-Language Navigation with Episodic Scene Memory
Zheng, Qi
Liu, Daqing
Wang, Chaoyue
Zhang, Jing
Wang, Dadong
Tao, Dacheng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 254 - 274
[39] DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
Sun, Yanjun
Qiu, Yue
Aoki, Yoshimitsu
SENSORS, 2025, 25 (02)
[40] Airbert: In-domain Pretraining for Vision-and-Language Navigation
Guhur, Pierre-Louis
Tapaswi, Makarand
Chen, Shizhe
Laptev, Ivan
Schmid, Cordelia
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1614 - 1623

← 1 2 3 4 5 →