Smoothed Action Value Functions for Learning Gaussian Policies

被引:0
|
作者
Nachum, Ofir [1 ]
Norouzi, Mohammad [1 ]
Tucker, George [1 ]
Schuurmans, Dale [1 ,2 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Learning and smoothed analysis
    Microsoft Research, New England, United States
    不详
    不详
    Proc. Annu. IEEE Symp. Found. Comput. Sci. FOCS, 1600, (395-404):
  • [32] Learning and smoothed analysis
    Kalai, Adam Tauman
    Samorodnitsky, Alex
    Teng, Shang-Hua
    2009 50TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE: FOCS 2009, PROCEEDINGS, 2009, : 395 - 404
  • [33] Teacher-directed learning with Gaussian and sigmoid activation functions
    Kamimura, R
    NEURAL INFORMATION PROCESSING, 2004, 3316 : 530 - 536
  • [34] Frequency Effects in Action Versus Value Learning
    Don, Hilary J.
    Worthy, Darrell A.
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-LEARNING MEMORY AND COGNITION, 2022, 48 (09) : 1311 - 1327
  • [35] Sparse Approximations to Value Functions in Reinforcement Learning
    Jakab, Hunor S.
    Csato, Lehel
    ARTIFICIAL NEURAL NETWORKS, 2015, : 295 - 314
  • [36] Multiagent Reinforcement Learning With Unshared Value Functions
    Hu, Yujing
    Gao, Yang
    An, Bo
    IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (04) : 647 - 662
  • [37] Brazilian Socio-Environmental Policies and the Learning of a New Action
    Silva Oliveira, Anderson Eduardo
    DESENVOLVIMENTO E MEIO AMBIENTE, 2011, 23 : 133 - 148
  • [38] Discrete Action On-Policy Learning with Action-Value Critic
    Yue, Yuguang
    Tang, Yunhao
    Yin, Mingzhang
    Zhou, Mingyuan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 1977 - 1986
  • [39] The value of interest rate stabilization policies when agents are learning
    Duffy, John
    Xiao, Wei
    JOURNAL OF MONEY CREDIT AND BANKING, 2007, 39 (08) : 2041 - 2056
  • [40] APPROXIMATING PALEY-WIENER FUNCTIONS BY SMOOTHED STEP FUNCTIONS
    BEATY, MG
    DODSON, MM
    HIGGINS, JR
    JOURNAL OF APPROXIMATION THEORY, 1994, 78 (03) : 433 - 445