Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

被引：0

作者：

Laroche, Romain ^{[1
]}

des Combes, Remi Tachet ^{[1
]}

机构：

[1] Microsoft Res Montreal, Montreal, PQ, Canada

来源：

INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151 | 2022年 / 151卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing q, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in O(t(-1)) under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.

引用

页码：5658 / 5688

页数：31

共 50 条

[1] Bayesian Policy Gradient and Actor-Critic Algorithms
Ghavamzadeh, Mohammad
Engel, Yaakov
Valko, Michal
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[2] Policy-Gradient Based Actor-Critic Algorithms
Awate, Yogesh P.
[J]. PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 505 - 509
[3] Boosting On-Policy Actor-Critic With Shallow Updates in Critic
Li, Luntong
Zhu, Yuanheng
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
[4] Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms
Jia, Yanwei
Zhou, Xun Yu
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
[5] Algorithms for Variance Reduction in a Policy-Gradient Based Actor-Critic Framework
Awate, Yogesh P.
[J]. ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 130 - 136
[6] Characterizing the Gap Between Actor-Critic and Policy Gradient
Wen, Junfeng
Kumar, Saurabh
Gummadi, Ramki
Schuurmans, Dale
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[7] Soft-Robust Actor-Critic Policy-Gradient
Derman, Esther
Mankowitz, Daniel J.
Mann, Timothy A.
Mannor, Shie
[J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2018, : 208 - 218
[8] Actor-critic algorithm with incremental dual natural policy gradient
[J]. 2017, Editorial Board of Journal on Communications (38):
[9] Actor-critic algorithms
Konda, VR
Tsitsiklis, JN
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
[10] On actor-critic algorithms
Konda, VR
Tsitsiklis, JN
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166

← 1 2 3 4 5 →