Smoothed Action Value Functions for Learning Gaussian Policies

被引：0

作者：

Nachum, Ofir ^{[1
]}

Norouzi, Mohammad ^{[1
]}

Tucker, George ^{[1
]}

Schuurmans, Dale ^{[1
,2
]}

机构：

[1] Google Brain, Mountain View, CA 94043 USA

[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80 | 2018年 / 80卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

State-action value functions (i.e., Q-values) are ubiquitous in reinforcement learning (RL), giving rise to popular algorithms such as SARSA and Q-learning. We propose a new notion of action value defined by a Gaussian smoothed version of the expected Q-value. We show that such smoothed Q-values still satisfy a Bellman equation, making them learnable from experience sampled from an environment. Moreover, the gradients of expected reward with respect to the mean and covariance of a parameterized Gaussian policy can be recovered from the gradient and Hessian of the smoothed Q-value function. Based on these relationships, we develop new algorithms for training a Gaussian policy directly from a learned smoothed Q-value approximator. The approach is additionally amenable to proximal optimization by augmenting the objective with a penalty on KL-divergence from a previous policy. We find that the ability to learn both a mean and covariance during training leads to significantly improved results on standard continuous control benchmarks.

引用

页数：9

共 50 条

[41] Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations
Derstroff, Cedric
Cerrato, Mattia
Brugger, Jannis
Peters, Jan
Kramer, Stefan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 11766 - 11774
[42] Revisiting Smoothed Online Learning
Zhang, Lijun
Jiang, Wei
Lu, Shiyin
Yang, Tianbao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[43] Learning, inference, and prediction on probability density functions with constrained Gaussian processes
Tran, Tien-Tam
Fradi, Anis
Samir, Chafik
INFORMATION SCIENCES, 2023, 642
[44] Value Iteration and Action ε-Approximation of Optimal Policies in Discounted Markov Decision Processes
Montes-De-Oca, Raul
Lemus-Rodriguez, Enrique
RECENT ADVANCES IN APPLIED MATHEMATICS, 2009, : 213 - +
[45] Genetic learning using adaptive action value tables
Yoshikawa, Masaya
Kihira, Takeshi
Terai, Hidekazu
ADVANCED TOPICS ON EVOLUTIONARY COMPUTING, 2008, : 136 - +
[46] Chaining Value Functions for Off-Policy Learning
Schmitt, Simon
Shawe-Taylor, John
van Hasselt, Hado
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8187 - 8195
[47] Learning Value Functions in Interactive Evolutionary Multiobjective Optimization
Branke, Juergen
Greco, Salvatore
Slowinski, Roman
Zielniewicz, Piotr
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2015, 19 (01) : 88 - 102
[48] Reconfigurable Embedded Devices Using Reinforcement Learning to Develop Action Policies
Burger, Alwyn
Schiele, Gregor
King, David W.
ACM TRANSACTIONS ON AUTONOMOUS AND ADAPTIVE SYSTEMS, 2021, 15 (04)
[49] Stimulus-Value and Action-Value Learning in Patients with Frontotemporal Dementia
Modirrousta, Mandana
Fellows, Lesley K.
Dickerson, Bradford
BIOLOGICAL PSYCHIATRY, 2014, 75 (09) : 194S - 194S
[50] Stimulus-Value and Action-Value Learning in Patients With Frontotemporal Dementia
Modirrousta, Mandana
Fellows, Lesley
Dickerson, Brad
JOURNAL OF NEUROPSYCHIATRY AND CLINICAL NEUROSCIENCES, 2014, 26 (02) : 2 - 3

← 1 2 3 4 5 →