Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

被引：0

作者：

Steckelmacher, Denis ^{[1
]}

Plisnier, Helene ^{[1
]}

Roijers, Diederik M. ^{[2
]}

Nowe, Ann ^{[1
]}

机构：

[1] Vrije Univ Brussel, Pl Laan 2, B-1050 Brussels, Belgium

[2] Vrije Univ Amsterdam, De Boelelaan 1105, NL-1081 HV Amsterdam, Netherlands

来源：

MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT III | 2020年 / 11908卷

关键词：

Reinforcement learning; Value iteration; Actor-critic; ALGORITHMS;

D O I：

10.1007/978-3-030-46133-1_2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Boot-strapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ailab/bdpi. Appendix: https://arxiv.org/abs/1903.04193.

引用

页码：19 / 34

页数：16

共 50 条

[1] Model-free off-policy reinforcement learning in continuous environment
Wawrzynski, P
Pacut, A
[J]. 2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 1091 - 1096
[2] Optimal model-free output synchronization of heterogeneous systems using off-policy reinforcement learning
Modares, Hamidreza
Nageshrao, Subramanya P.
Lopes, Gabriel A. Delgado
Babuska, Robert
Lewis, Frank L.
[J]. AUTOMATICA, 2016, 71 : 334 - 341
[3] Safe and efficient off-policy reinforcement learning
Munos, Remi
Stepleton, Thomas
Harutyunyan, Anna
Bellemare, Marc G.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[4] OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching
Hoshino, Hana
Ota, Kei
Kanezaki, Asako
Yokota, Rio
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2022), 2022,
[5] Model-Free Solution to the Discrete-Time Coupled Riccati Equation Using Off-Policy Reinforcement Learning
Li, Lu
Wang, Liming
Yang, Yongliang
Dong, Jie
Yin, Yixin
Cheng, Shusen
[J]. PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 6813 - 6818
[6] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
Thomas, Philip S.
Brunskill, Emma
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[7] Model-free H∞ tracking control for de-oiling hydrocyclone systems via off-policy reinforcement learning
Li, Shaobao
Durdevic, Petar
Yang, Zhenyu
[J]. AUTOMATICA, 2021, 133
[8] Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
Yin, Ming
Wang, Yu-Xiang
[J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
[9] Robust hierarchical games of linear discrete-time systems based on off-policy model-free reinforcement learning
Ma, Xiao
Yuan, Yuan
[J]. JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2024, 361 (07):
[10] Off-policy RL algorithms can be sample-efficient for continuous control via sample multiple reuse
Lyu, Jiafei
Wan, Le
Li, Xiu
Lu, Zongqing
[J]. INFORMATION SCIENCES, 2024, 666

← 1 2 3 4 5 →