Expected Policy Gradients for Reinforcement Learning

被引:0
|
作者
Ciosek, Kamil [1 ]
Whiteson, Shimon [2 ]
机构
[1] Microsoft Res Cambridge, 21 Stn Rd, Cambridge CB1 2FB, England
[2] Univ Oxford, Dept Comp Sci, Wolfson Bldg,Parks Rd, Oxford OX1 3QD, England
基金
欧洲研究理事会;
关键词
policy gradients; exploration; bounded actions; reinforcement learning; Markov decision process (MDP);
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to eH, where H is the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
引用
收藏
页数:51
相关论文
共 50 条
  • [31] Improving performance of deep learning models with axiomatic attribution priors and expected gradients
    Erion, Gabriel
    Janizek, Joseph D.
    Sturmfels, Pascal
    Lundberg, Scott M.
    Lee, Su-In
    NATURE MACHINE INTELLIGENCE, 2021, 3 (07) : 620 - 631
  • [32] Learning to Pour using Deep Deterministic Policy Gradients
    Do, Chau
    Gordillo, Camilo
    Burgard, Wolfram
    2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 3074 - 3079
  • [33] Action control, forward models and expected rewards: representations in reinforcement learning
    Anna-Mari Rusanen
    Otto Lappi
    Jesse Kuokkanen
    Jami Pekkanen
    Synthese, 2021, 199 : 14017 - 14033
  • [34] Inverse Reinforcement Learning with Explicit Policy Estimates
    Sanghvi, Navyata
    Usami, Shinnosuke
    Sharma, Mohit
    Groeger, Joachim
    Kitani, Kris
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9472 - 9480
  • [35] Integrating Classical Control into Reinforcement Learning Policy
    Huang, Ye
    Gu, Chaochen
    Guan, Xinping
    NEURAL PROCESSING LETTERS, 2021, 53 (03) : 1709 - 1722
  • [36] Reward Certification for Policy Smoothed Reinforcement Learning
    Mu, Ronghui
    Marcolino, Leandro Soriano
    Zhang, Yanghao
    Zhang, Tianle
    Huang, Xiaowei
    Ruan, Wenjie
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21429 - 21437
  • [37] Integrating Classical Control into Reinforcement Learning Policy
    Ye Huang
    Chaochen Gu
    Xinping Guan
    Neural Processing Letters, 2021, 53 : 1709 - 1722
  • [38] Unified Policy Optimization for Robust Reinforcement Learning
    Lin, Zichuan
    Zhao, Li
    Bian, Jiang
    Qin, Tao
    Yang, Guangwen
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 395 - 410
  • [39] A modification of gradient policy in reinforcement learning procedure
    Abas, Marcel
    Skripcak, Tomas
    2012 15TH INTERNATIONAL CONFERENCE ON INTERACTIVE COLLABORATIVE LEARNING (ICL), 2012,
  • [40] Model-free Policy Learning with Reward Gradients
    Lan, Qingfong
    Tosatto, Samuele
    Farrahi, Homayoon
    Mahmood, A. Rupam
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151