Smoothed Action Value Functions for Learning Gaussian Policies

被引：0

作者：

Nachum, Ofir ^{[1
]}

Norouzi, Mohammad ^{[1
]}

Tucker, George ^{[1
]}

Schuurmans, Dale ^{[1
,2
]}

机构：

[1] Google Brain, Mountain View, CA 94043 USA

[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

引用

页数：9

共 50 条

[1] Learning action-value functions using neural networks with incremental learning ability
Shiraga, N
Ozawa, S
Abe, S
KNOWLEDGE-BASED INTELLIGENT INFORMATION ENGINEERING SYSTEMS & ALLIED TECHNOLOGIES, PTS 1 AND 2, 2001, 69 : 22 - 26
[2] Context Transfer in Reinforcement Learning Using Action-Value Functions
Mousavi, Amin
Araabi, Babak Nadjar
Ahmadabadi, Majid Nili
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2014, 2014
[3] Learning Choice Functions with Gaussian Processes
Benavoli, Alessio
Azzimonti, Dario
Piga, Dario
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 141 - 151
[4] Functions and forms of action learning
Boak, George
ACTION LEARNING, 2025, 22 (01): : 5 - 6
[5] Computing factored value functions for policies in structured MDPs
Koller, D
Parr, R
IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 1332 - 1339
[6] On the structure of value functions for threshold policies in queueing models
Bhulai, S
Koole, G
JOURNAL OF APPLIED PROBABILITY, 2003, 40 (03) : 613 - 622
[7] A hybrid learning strategy for discovery of policies of action
Ribeiro, Richardson
Enembreck, Fabricio
Koerich, Alessandro L.
ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA-SBIA 2006, PROCEEDINGS, 2006, 4140 : 268 - 277
[8] Learning Continuous-Action Control Policies
Pazis, Jason
Lagoudakis, Michail G.
ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 169 - 176
[9] Composite Gaussian processes flows for learning discontinuous multimodal policies
Wang, Shu-yuan
Sasaki, Hikaru
Matsubara, Takamitsu
APPLIED INTELLIGENCE, 2025, 55 (06)
[10] LEARNING OPTIMAL POLICIES IN POTENTIAL MEAN FIELD GAMES: SMOOTHED POLICY ITERATION ALGORITHMS
Tang, Qing
Song, Jiahao
SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2024, 62 (01) : 351 - 375

← 1 2 3 4 5 →