Off-Policy Actor-Critic with Emphatic Weightings

被引：0

作者：

Graves, Eric ^{[1
]}

Imani, Ehsan ^{[1
]}

Kumaraswamy, Raksha ^{[1
]}

White, Martha ^{[1
]}

机构：

[1] Univ Alberta, Dept Comp Sci, Reinforcement Learning & Artificial Intelligence L, Edmonton, AB T6G 2E8, Canada

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2023年 / 24卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

off-policy learning; policy gradient; actor-critic; reinforcement learning;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi -gradient) off-policy actor-critic methods-particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)-converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semigradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.

引用

页数：63

共 50 条

[1] Generalized Off-Policy Actor-Critic
Zhang, Shangtong
Boehmer, Wendelin
Whiteson, Shimon
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[2] Distributed Actor-Critic Learning Using Emphatic Weightings
Stankovic, Milos S.
Beko, Marko
Stankovic, Srdjan S.
[J]. 2022 8TH INTERNATIONAL CONFERENCE ON CONTROL, DECISION AND INFORMATION TECHNOLOGIES (CODIT'22), 2022, : 1167 - 1172
[3] Off-Policy Actor-critic for Recommender Systems
Chen, Minmin
Xu, Can
Gatto, Vince
Jain, Devanshu
Kumar, Aviral
Chi, Ed
[J]. PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022, 2022, : 338 - 349
[4] Meta attention for Off-Policy Actor-Critic
Huang, Jiateng
Huang, Wanrong
Lan, Long
Wu, Dan
[J]. NEURAL NETWORKS, 2023, 163 : 86 - 96
[5] Noisy Importance Sampling Actor-Critic: An Off-Policy Actor-Critic With Experience Replay
Tasfi, Norman
Capretz, Miriam
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[6] Variance Penalized On-Policy and Off-Policy Actor-Critic
Jain, Arushi
Patil, Gandharv
Jain, Ayush
Khetarpa, Khimya
Precup, Doina
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7899 - 7907
[7] Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
Xu, Tengyu
Yang, Zhuoran
Wang, Zhaoran
Liang, Yingbin
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[8] Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus
Zhang, Yan
Zavlanos, Michael M.
[J]. 2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 4674 - 4679
[9] Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
Zhou, Wei
Li, Yiying
Yang, Yongxin
Wang, Huaimin
Hospedales, Timothy M.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[10] An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Imani, Ehsan
Graves, Eric
White, Martha
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →