An Off-policy Policy Gradient Theorem Using Emphatic Weightings

被引:0
|
作者
Imani, Ehsan [1 ]
Graves, Eric [1 ]
White, Martha [1 ]
机构
[1] Univ Alberta, Reinforcement Learning & Artificial Intelligence, Dept Comp Sci, Edmonton, AB, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm-called Actor Critic with Emphatic weightings (ACE)-that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods-particularly OffPAC and DPG-converge to the wrong solution whereas ACE finds the optimal solution.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
    Cao, Jiaqing
    Liu, Quan
    Zhu, Fei
    Fu, Qiming
    Zhong, Shan
    [J]. INFORMATION SCIENCES, 2021, 580 : 311 - 330
  • [2] Off-Policy Actor-Critic with Emphatic Weightings
    Graves, Eric
    Imani, Ehsan
    Kumaraswamy, Raksha
    White, Martha
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [3] A Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Abdulsamad, Hany
    Peters, Jan
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [4] Off-Policy Policy Gradient with State Distribution Correction
    Liu, Yao
    Swaminathan, Adith
    Agarwal, Alekh
    Brunskill, Emma
    [J]. 35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 1180 - 1190
  • [5] An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
    Sutton, Richard S.
    Mahmood, A. Rupam
    White, Martha
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [6] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Peters, Jan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
  • [7] Off-Policy Evaluation via Off-Policy Classification
    Irpan, Alex
    Rao, Kanishka
    Bousmalis, Konstantinos
    Harris, Chris
    Ibarz, Julian
    Levine, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller
    Nakamura, Yutaka
    Mori, Takeshi
    Tokita, Yoichi
    Shibata, Tomohiro
    Ishii, Shin
    [J]. JOURNAL OF ROBOTICS AND MECHATRONICS, 2005, 17 (06) : 636 - 644
  • [9] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning Shixiang
    Gu, Shixiang
    Lillicrap, Timothy
    Ghahramani, Zoubin
    Turner, Richard E.
    Scholkopf, Bernhard
    Levine, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [10] Off-Policy Proximal Policy Optimization
    Meng, Wenjia
    Zheng, Qian
    Pan, Gang
    Yin, Yilong
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9162 - 9170