Expected Policy Gradients for Reinforcement Learning

被引:0
|
作者
Ciosek, Kamil [1 ]
Whiteson, Shimon [2 ]
机构
[1] Microsoft Res Cambridge, 21 Stn Rd, Cambridge CB1 2FB, England
[2] Univ Oxford, Dept Comp Sci, Wolfson Bldg,Parks Rd, Oxford OX1 3QD, England
基金
欧洲研究理事会;
关键词
policy gradients; exploration; bounded actions; reinforcement learning; Markov decision process (MDP);
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to eH, where H is the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
引用
收藏
页数:51
相关论文
共 50 条
  • [41] Multiagent Reinforcement Learning:Rollout and Policy Iteration
    Dimitri Bertsekas
    IEEE/CAAJournalofAutomaticaSinica, 2021, 8 (02) : 249 - 272
  • [42] Policy Reuse in Reinforcement Learning for Modular Agents
    Raza, Sayyed Jaffar Ali
    Lin, Mingjie
    2019 IEEE 2ND INTERNATIONAL CONFERENCE ON INFORMATION AND COMPUTER TECHNOLOGIES (ICICT), 2019, : 165 - 169
  • [43] Fast Policy Learning through Imitation and Reinforcement
    Cheng, Ching-An
    Yan, Xinyan
    Wagener, Nolan
    Boots, Byron
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2018, : 845 - 855
  • [44] Global Policy Construction in Modular Reinforcement Learning
    Zhang, Ruohan
    Song, Zhao
    Ballard, Dana H.
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 4226 - 4227
  • [45] Relabeling and policy distillation of hierarchical reinforcement learning
    Zou, Qijie
    Zhao, Xiling
    Gao, Bing
    Chen, Shuang
    Liu, Zhiguo
    Zhang, Zhejie
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (11) : 4923 - 4939
  • [46] Hierarchical Reinforcement Learning for Pedagogical Policy Induction
    Zhou, Guojing
    Azizsoltani, Hamoon
    Ausin, Markel Sanz
    Barnes, Tiffany
    Chi, Min
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 4691 - 4695
  • [47] Fingerprint Policy Optimisation for Robust Reinforcement Learning
    Paul, Supratik
    Osborne, Michael A.
    Whiteson, Shimon
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [48] Probabilistic Policy Reuse for Safe Reinforcement Learning
    Garcia, Javier
    Fernandez, Fernando
    ACM TRANSACTIONS ON AUTONOMOUS AND ADAPTIVE SYSTEMS, 2019, 13 (03)
  • [49] Quantum reinforcement learning via policy iteration
    El Amine Cherrat
    Iordanis Kerenidis
    Anupam Prakash
    Quantum Machine Intelligence, 2023, 5
  • [50] Adaptive Evolutionary Reinforcement Learning with Policy Direction
    Caibo Dong
    Dazi Li
    Neural Processing Letters, 56