Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control

被引:0
|
作者
Cao, Jiaqing [1 ]
Liu, Quan [1 ]
Wu, Lan [1 ]
Fu, Qiming [2 ]
Zhong, Shan [3 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[3] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; Off-policy learning; Emphatic approach; Gradient temporal-difference learning; Gradient emphasis learning;
D O I
10.1007/s10489-023-04579-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.
引用
收藏
页码:20917 / 20937
页数:21
相关论文
共 50 条
  • [21] Online Attentive Kernel-Based Off-Policy Temporal Difference Learning
    Yang, Shangdong
    Zhang, Shuaiqiang
    Chen, Xingguo
    APPLIED SCIENCES-BASEL, 2024, 14 (23):
  • [22] Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
    Jia, Yanwei
    Zhou, Xun Yu
    JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
  • [23] Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning
    Shi, Wenjie
    Song, Shiji
    Wu, Hui
    Hsu, Ya-Chu
    Wu, Cheng
    Huang, Gao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [24] Distributed Off-Policy Temporal Difference Learning Using Primal-Dual Method
    Lee, Donghwan
    Kim, Do Wan
    Hu, Jianghai
    IEEE ACCESS, 2022, 10 : 107077 - 107094
  • [25] Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    IFAC PAPERSONLINE, 2020, 53 (02): : 1563 - 1568
  • [26] Off-Policy Evaluation via Off-Policy Classification
    Irpan, Alex
    Rao, Kanishka
    Bousmalis, Konstantinos
    Harris, Chris
    Ibarz, Julian
    Levine, Sergey
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [27] Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates
    Penedones, Hugo
    Riquelme, Carlos
    Vincent, Damien
    Maennel, Hartmut
    Mann, Timothy
    Barreto, Andre
    Gelly, Sylvain
    Neu, Gergely
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [28] Learning Action Embeddings for Off-Policy Evaluation
    Cief, Matej
    Golebiowski, Jacek
    Schmidt, Philipp
    Abedjan, Ziawasch
    Bekasov, Artur
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 108 - 122
  • [29] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [30] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    Frontiers of Computer Science, 2019, 13 : 911 - 912