Off-Policy Actor-Critic with Emphatic Weightings

被引:0
|
作者
Graves, Eric [1 ]
Imani, Ehsan [1 ]
Kumaraswamy, Raksha [1 ]
White, Martha [1 ]
机构
[1] Univ Alberta, Dept Comp Sci, Reinforcement Learning & Artificial Intelligence L, Edmonton, AB T6G 2E8, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
off-policy learning; policy gradient; actor-critic; reinforcement learning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi -gradient) off-policy actor-critic methods-particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)-converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semigradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.
引用
收藏
页数:63
相关论文
共 50 条
  • [11] Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm
    Diddigi, Raghuram Bharadwaj
    Jain, Prateek
    Prabuchandran, K. J.
    Bhatnagar, Shalabh
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [12] An Optimistic Approach to the Temporal Difference Error in Off-Policy Actor-Critic Algorithms
    Saglam, Baturay
    Mutlu, Furkan B.
    Kozat, Suleyman S.
    [J]. 2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 875 - 883
  • [13] Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm
    Khodadadian, Sajad
    Chen, Zaiwei
    Maguluri, Siva Theja
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [14] Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems With Disturbances
    Song, Ruizhuo
    Lewis, Frank L.
    Wei, Qinglai
    Zhang, Huaguang
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2016, 46 (05) : 1041 - 1050
  • [15] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
    Haarnoja, Tuomas
    Zhou, Aurick
    Abbeel, Pieter
    Levine, Sergey
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [16] Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples With On-Policy Experiences
    Banerjee, Chayan
    Chen, Zhiyong
    Noman, Nasimul
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (03) : 3121 - 3129
  • [17] An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control
    Zhang, Huaqing
    Ma, Hongbin
    Jin, Ying
    [J]. INTELLIGENT ROBOTICS AND APPLICATIONS (ICIRA 2022), PT IV, 2022, 13458 : 449 - 458
  • [18] A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning
    Suttle, Wesley
    Yang, Zhuoran
    Zhang, Kaiqing
    Wang, Zhaoran
    Basar, Tamer
    Liu, Ji
    [J]. IFAC PAPERSONLINE, 2020, 53 (02): : 1549 - 1554
  • [19] Finite-Sample Analysis of Off-Policy Natural Actor-Critic With Linear Function Approximation
    Chen, Zaiwei
    Khodadadian, Sajad
    Maguluri, Siva Theja
    [J]. IEEE CONTROL SYSTEMS LETTERS, 2022, 6 : 2611 - 2616
  • [20] Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
    Duan, Jingliang
    Guan, Yang
    Li, Shengbo Eben
    Ren, Yangang
    Sun, Qi
    Cheng, Bo
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (11) : 6584 - 6598