An Off-policy Policy Gradient Theorem Using Emphatic Weightings

被引：0

作者：

Imani, Ehsan ^{[1
]}

Graves, Eric ^{[1
]}

White, Martha ^{[1
]}

机构：

[1] Univ Alberta, Reinforcement Learning & Artificial Intelligence, Dept Comp Sci, Edmonton, AB, Canada

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018) | 2018年 / 31卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm-called Actor Critic with Emphatic weightings (ACE)-that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods-particularly OffPAC and DPG-converge to the wrong solution whereas ACE finds the optimal solution.

引用

页数：11

共 50 条

[1] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
Cao, Jiaqing
Liu, Quan
Zhu, Fei
Fu, Qiming
Zhong, Shan
[J]. INFORMATION SCIENCES, 2021, 580 : 311 - 330
[2] Off-Policy Actor-Critic with Emphatic Weightings
Graves, Eric
Imani, Ehsan
Kumaraswamy, Raksha
White, Martha
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
[3] A Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Abdulsamad, Hany
Peters, Jan
[J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
[4] Off-Policy Policy Gradient with State Distribution Correction
Liu, Yao
Swaminathan, Adith
Agarwal, Alekh
Brunskill, Emma
[J]. 35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 1180 - 1190
[5] An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
Sutton, Richard S.
Mahmood, A. Rupam
White, Martha
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[6] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
Tosatto, Samuele
Carvalho, Joao
Peters, Jan
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
[7] Off-Policy Evaluation via Off-Policy Classification
Irpan, Alex
Rao, Kanishka
Bousmalis, Konstantinos
Harris, Chris
Ibarz, Julian
Levine, Sergey
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[8] Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller
Nakamura, Yutaka
Mori, Takeshi
Tokita, Yoichi
Shibata, Tomohiro
Ishii, Shin
[J]. JOURNAL OF ROBOTICS AND MECHATRONICS, 2005, 17 (06) : 636 - 644
[9] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning Shixiang
Gu, Shixiang
Lillicrap, Timothy
Ghahramani, Zoubin
Turner, Richard E.
Scholkopf, Bernhard
Levine, Sergey
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[10] Off-Policy Proximal Policy Optimization
Meng, Wenjia
Zheng, Qian
Pan, Gang
Yin, Yilong
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170

← 1 2 3 4 5 →