Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

被引:0
|
作者
Laroche, Romain [1 ]
des Combes, Remi Tachet [1 ]
机构
[1] Microsoft Res Montreal, Montreal, PQ, Canada
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing q, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in O(t(-1)) under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
引用
收藏
页码:5658 / 5688
页数:31
相关论文
共 50 条
  • [1] Bayesian Policy Gradient and Actor-Critic Algorithms
    Ghavamzadeh, Mohammad
    Engel, Yaakov
    Valko, Michal
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [2] Policy-Gradient Based Actor-Critic Algorithms
    Awate, Yogesh P.
    [J]. PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 505 - 509
  • [3] Boosting On-Policy Actor-Critic With Shallow Updates in Critic
    Li, Luntong
    Zhu, Yuanheng
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [4] Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms
    Jia, Yanwei
    Zhou, Xun Yu
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [5] Algorithms for Variance Reduction in a Policy-Gradient Based Actor-Critic Framework
    Awate, Yogesh P.
    [J]. ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 130 - 136
  • [6] Characterizing the Gap Between Actor-Critic and Policy Gradient
    Wen, Junfeng
    Kumar, Saurabh
    Gummadi, Ramki
    Schuurmans, Dale
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [7] Soft-Robust Actor-Critic Policy-Gradient
    Derman, Esther
    Mankowitz, Daniel J.
    Mann, Timothy A.
    Mannor, Shie
    [J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2018, : 208 - 218
  • [8] Actor-critic algorithm with incremental dual natural policy gradient
    [J]. 2017, Editorial Board of Journal on Communications (38):
  • [9] Actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
  • [10] On actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166