Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

被引:0
|
作者
Steckelmacher, Denis [1 ]
Plisnier, Helene [1 ]
Roijers, Diederik M. [2 ]
Nowe, Ann [1 ]
机构
[1] Vrije Univ Brussel, Pl Laan 2, B-1050 Brussels, Belgium
[2] Vrije Univ Amsterdam, De Boelelaan 1105, NL-1081 HV Amsterdam, Netherlands
关键词
Reinforcement learning; Value iteration; Actor-critic; ALGORITHMS;
D O I
10.1007/978-3-030-46133-1_2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Boot-strapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ailab/bdpi. Appendix: https://arxiv.org/abs/1903.04193.
引用
收藏
页码:19 / 34
页数:16
相关论文
共 50 条
  • [1] Model-free off-policy reinforcement learning in continuous environment
    Wawrzynski, P
    Pacut, A
    [J]. 2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2004, : 1091 - 1096
  • [2] Optimal model-free output synchronization of heterogeneous systems using off-policy reinforcement learning
    Modares, Hamidreza
    Nageshrao, Subramanya P.
    Lopes, Gabriel A. Delgado
    Babuska, Robert
    Lewis, Frank L.
    [J]. AUTOMATICA, 2016, 71 : 334 - 341
  • [3] Safe and efficient off-policy reinforcement learning
    Munos, Remi
    Stepleton, Thomas
    Harutyunyan, Anna
    Bellemare, Marc G.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [4] OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching
    Hoshino, Hana
    Ota, Kei
    Kanezaki, Asako
    Yokota, Rio
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2022), 2022,
  • [5] Model-Free Solution to the Discrete-Time Coupled Riccati Equation Using Off-Policy Reinforcement Learning
    Li, Lu
    Wang, Liming
    Yang, Yongliang
    Dong, Jie
    Yin, Yixin
    Cheng, Shusen
    [J]. PROCEEDINGS OF THE 38TH CHINESE CONTROL CONFERENCE (CCC), 2019, : 6813 - 6818
  • [6] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
    Thomas, Philip S.
    Brunskill, Emma
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [7] Model-free H∞ tracking control for de-oiling hydrocyclone systems via off-policy reinforcement learning
    Li, Shaobao
    Durdevic, Petar
    Yang, Zhenyu
    [J]. AUTOMATICA, 2021, 133
  • [8] Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
    Yin, Ming
    Wang, Yu-Xiang
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [9] Robust hierarchical games of linear discrete-time systems based on off-policy model-free reinforcement learning
    Ma, Xiao
    Yuan, Yuan
    [J]. JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2024, 361 (07):
  • [10] Off-policy RL algorithms can be sample-efficient for continuous control via sample multiple reuse
    Lyu, Jiafei
    Wan, Le
    Li, Xiu
    Lu, Zongqing
    [J]. INFORMATION SCIENCES, 2024, 666