Smoothed Action Value Functions for Learning Gaussian Policies

被引:0
|
作者
Nachum, Ofir [1 ]
Norouzi, Mohammad [1 ]
Tucker, George [1 ]
Schuurmans, Dale [1 ,2 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations
    Derstroff, Cedric
    Cerrato, Mattia
    Brugger, Jannis
    Peters, Jan
    Kramer, Stefan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11766 - 11774
  • [42] Revisiting Smoothed Online Learning
    Zhang, Lijun
    Jiang, Wei
    Lu, Shiyin
    Yang, Tianbao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [43] Learning, inference, and prediction on probability density functions with constrained Gaussian processes
    Tran, Tien-Tam
    Fradi, Anis
    Samir, Chafik
    INFORMATION SCIENCES, 2023, 642
  • [44] Value Iteration and Action ε-Approximation of Optimal Policies in Discounted Markov Decision Processes
    Montes-De-Oca, Raul
    Lemus-Rodriguez, Enrique
    RECENT ADVANCES IN APPLIED MATHEMATICS, 2009, : 213 - +
  • [45] Genetic learning using adaptive action value tables
    Yoshikawa, Masaya
    Kihira, Takeshi
    Terai, Hidekazu
    ADVANCED TOPICS ON EVOLUTIONARY COMPUTING, 2008, : 136 - +
  • [46] Chaining Value Functions for Off-Policy Learning
    Schmitt, Simon
    Shawe-Taylor, John
    van Hasselt, Hado
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8187 - 8195
  • [47] Learning Value Functions in Interactive Evolutionary Multiobjective Optimization
    Branke, Juergen
    Greco, Salvatore
    Slowinski, Roman
    Zielniewicz, Piotr
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2015, 19 (01) : 88 - 102
  • [48] Reconfigurable Embedded Devices Using Reinforcement Learning to Develop Action Policies
    Burger, Alwyn
    Schiele, Gregor
    King, David W.
    ACM TRANSACTIONS ON AUTONOMOUS AND ADAPTIVE SYSTEMS, 2021, 15 (04)
  • [49] Stimulus-Value and Action-Value Learning in Patients with Frontotemporal Dementia
    Modirrousta, Mandana
    Fellows, Lesley K.
    Dickerson, Bradford
    BIOLOGICAL PSYCHIATRY, 2014, 75 (09) : 194S - 194S
  • [50] Stimulus-Value and Action-Value Learning in Patients With Frontotemporal Dementia
    Modirrousta, Mandana
    Fellows, Lesley
    Dickerson, Brad
    JOURNAL OF NEUROPSYCHIATRY AND CLINICAL NEUROSCIENCES, 2014, 26 (02) : 2 - 3