Accelerating Self-Imitation Learning from Demonstrations via Policy Constraints and Q-Ensemble

被引：0

作者：

Li, Chao ^{[1
]}

Wu, Fengge ^{[1
]}

Zhao, Junsuo ^{[1
]}

机构：

[1] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

Deep Reinforcement Learning; Learning from Demonstrations; Self-Imitation Learning; Sample Efficiency;

D O I：

10.1109/IJCNN54540.2023.10191691

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named Accelerating Self-Imitation Learning from Demonstrations (A-SILfD), which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training. In addition, our ablation experiments demonstrate the effectiveness of each part of the method.

引用

下载

页数：8

共 14 条

[1] Learning Category-Level Generalizable Object Manipulation Policy Via Generative Adversarial Self-Imitation Learning From Demonstrations
Shen, Hao
Wan, Weikang
Wang, He
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04) : 11166 - 11173
[2] Self-Imitation Learning via Generalized Lower Bound Q-learning
Tang, Yunhao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[3] Learning Robotic Skills via Self-Imitation and Guide Reward
Ran, Chenyang
Su, Jianbo
2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 2158 - 2163
[4] Automated Anomaly Detection via Curiosity-Guided Search and Self-Imitation Learning
Li, Yuening
Chen, Zhengzhang
Zha, Daochen
Zhou, Kaixiong
Jin, Haifeng
Chen, Haifeng
Hu, Xia
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (06) : 2365 - 2377
[5] Harnessing Network Effect for Fake News Mitigation: Selecting Debunkers via Self-Imitation Learning
Xu, Xiaofei
Deng, Ke
Dann, Michael
Zhang, Xiuzhen
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22447 - 22456
[6] Taking complementary advantages: Improving exploration via double self-imitation learning in procedurally-generated environments
Lin, Hao
He, Yue
Li, Fanzhang
Liu, Quan
Wang, Bangjun
Zhu, Fei
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[7] Self-Practice Imitation Learning from Weak Policy
Da, Qing
Yu, Yang
Zhou, Zhi-Hua
PARTIALLY SUPERVISED LEARNING, PSL 2013, 2013, 8193 : 9 - 20
[8] Offline Reinforcement Learning via Policy Regularization and Ensemble Q-Functions
Wang, Tao
Xie, Shaorong
Gao, Mingke
Chen, Xue
Zhang, Zhenyu
Yu, Hang
2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1167 - 1174
[9] Self-Adaptive Imitation Learning: Learning Tasks with Delayed Rewards from Sub-optimal Demonstrations
Zhu, Zhuangdi
Lin, Kaixiang
Dai, Bo
Zhou, Jiayu
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 9269 - 9277
[10] DROID: Learning from Offline Heterogeneous Demonstrations via Reward-Policy Distillation
Jayanthi, Sravan
Chen, Letian
Balabanska, Nadya
Duong, Van
Scarlatescu, Erik
Ameperosa, Ezra
Zaidi, Zulfiqar
Martin, Daniel
Del Matto, Taylor
Ono, Masahiro
Gombolay, Matthew
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229

← 1 2 →