Expected Policy Gradients for Reinforcement Learning

被引:0
|
作者
Ciosek, Kamil [1 ]
Whiteson, Shimon [2 ]
机构
[1] Microsoft Res Cambridge, 21 Stn Rd, Cambridge CB1 2FB, England
[2] Univ Oxford, Dept Comp Sci, Wolfson Bldg,Parks Rd, Oxford OX1 3QD, England
基金
欧洲研究理事会;
关键词
policy gradients; exploration; bounded actions; reinforcement learning; Markov decision process (MDP);
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to eH, where H is the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
引用
收藏
页数:51
相关论文
共 50 条
  • [1] Expected policy gradients for reinforcement learning
    Ciosek, Kamil
    Whiteson, Shimon
    Journal of Machine Learning Research, 2020, 21
  • [2] Expected Policy Gradients
    Ciosek, Kamil
    Whiteson, Shimon
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2868 - 2875
  • [3] Reinforcement learning of motor skills with policy gradients
    Peters, Jan
    Schaal, Stefan
    NEURAL NETWORKS, 2008, 21 (04) : 682 - 697
  • [4] Human-Machine Coadaptation Based on Reinforcement Learning with Policy Gradients
    Tahboub, Karim A.
    2019 8TH INTERNATIONAL CONFERENCE ON SYSTEMS AND CONTROL (ICSC'19), 2019, : 247 - 251
  • [5] Reinforcement Learning in Sparse-Reward Environments With Hindsight Policy Gradients
    Rauber, Paulo
    Ummadisingu, Avinash
    Mutz, Filipe
    Schmidhuber, Juergen
    NEURAL COMPUTATION, 2021, 33 (06) : 1498 - 1553
  • [6] A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients
    Grondman, Ivo
    Busoniu, Lucian
    Lopes, Gabriel A. D.
    Babuska, Robert
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2012, 42 (06): : 1291 - 1307
  • [7] The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations
    Lehmann, Matthias
    arXiv, 1600,
  • [8] Beyond Expected Return: Accounting for Policy Reproducibility When Evaluating Reinforcement Learning Algorithms
    Flageat, Manon
    Lim, Bryan
    Cully, Antoine
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11, 2024, : 12024 - 12032
  • [9] DELAY OF REINFORCEMENT GRADIENTS IN CHILDRENS LEARNING
    WALTERS, RH
    PSYCHONOMIC SCIENCE, 1964, 1 (10): : 307 - 308
  • [10] Batch Reinforcement Learning with Hyperparameter Gradients
    Lee, Byung-Jun
    Lee, Jongmin
    Vrancx, Peter
    Kim, Dongho
    Kim, Kee-Eung
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119