Smoothed Action Value Functions for Learning Gaussian Policies

被引:0
|
作者
Nachum, Ofir [1 ]
Norouzi, Mohammad [1 ]
Tucker, George [1 ]
Schuurmans, Dale [1 ,2 ]
机构
[1] Google Brain, Mountain View, CA 94043 USA
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Learning action-value functions using neural networks with incremental learning ability
    Shiraga, N
    Ozawa, S
    Abe, S
    KNOWLEDGE-BASED INTELLIGENT INFORMATION ENGINEERING SYSTEMS & ALLIED TECHNOLOGIES, PTS 1 AND 2, 2001, 69 : 22 - 26
  • [2] Context Transfer in Reinforcement Learning Using Action-Value Functions
    Mousavi, Amin
    Araabi, Babak Nadjar
    Ahmadabadi, Majid Nili
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2014, 2014
  • [3] Learning Choice Functions with Gaussian Processes
    Benavoli, Alessio
    Azzimonti, Dario
    Piga, Dario
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 141 - 151
  • [4] Functions and forms of action learning
    Boak, George
    ACTION LEARNING, 2025, 22 (01): : 5 - 6
  • [5] Computing factored value functions for policies in structured MDPs
    Koller, D
    Parr, R
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 1332 - 1339
  • [6] On the structure of value functions for threshold policies in queueing models
    Bhulai, S
    Koole, G
    JOURNAL OF APPLIED PROBABILITY, 2003, 40 (03) : 613 - 622
  • [7] A hybrid learning strategy for discovery of policies of action
    Ribeiro, Richardson
    Enembreck, Fabricio
    Koerich, Alessandro L.
    ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA-SBIA 2006, PROCEEDINGS, 2006, 4140 : 268 - 277
  • [8] Learning Continuous-Action Control Policies
    Pazis, Jason
    Lagoudakis, Michail G.
    ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 169 - 176
  • [9] Composite Gaussian processes flows for learning discontinuous multimodal policies
    Wang, Shu-yuan
    Sasaki, Hikaru
    Matsubara, Takamitsu
    APPLIED INTELLIGENCE, 2025, 55 (06)
  • [10] LEARNING OPTIMAL POLICIES IN POTENTIAL MEAN FIELD GAMES: SMOOTHED POLICY ITERATION ALGORITHMS
    Tang, Qing
    Song, Jiahao
    SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2024, 62 (01) : 351 - 375