Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control

被引:0
|
作者
Cao, Jiaqing [1 ]
Liu, Quan [1 ]
Wu, Lan [1 ]
Fu, Qiming [2 ]
Zhong, Shan [3 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[3] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; Off-policy learning; Emphatic approach; Gradient temporal-difference learning; Gradient emphasis learning;
D O I
10.1007/s10489-023-04579-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.
引用
收藏
页码:20917 / 20937
页数:21
相关论文
共 50 条
  • [1] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
    Jiaqing Cao
    Quan Liu
    Lan Wu
    Qiming Fu
    Shan Zhong
    Applied Intelligence, 2023, 53 : 20917 - 20937
  • [2] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
    Cao, Jiaqing
    Liu, Quan
    Zhu, Fei
    Fu, Qiming
    Zhong, Shan
    INFORMATION SCIENCES, 2021, 580 : 311 - 330
  • [3] An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
    Sutton, Richard S.
    Mahmood, A. Rupam
    White, Martha
    JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [4] Distributed Consensus-Based Multi-Agent Off-Policy Temporal-Difference Learning
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, : 5976 - 5981
  • [5] Modified Retrace for Off-Policy Temporal Difference Learning
    Chen, Xingguo
    Ma, Xingzhou
    Li, Yang
    Yang, Guang
    Yang, Shangdong
    Gao, Yang
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 303 - 312
  • [6] Off-Policy Temporal Difference Learning with Bellman Residuals
    Yang, Shangdong
    Sun, Dingyuanhao
    Chen, Xingguo
    MATHEMATICS, 2024, 12 (22)
  • [7] Gradient Temporal-Difference Learning with Regularized Corrections
    Ghiassian, Sina
    Patterson, Andrew
    Garg, Shivam
    Gupta, Dhawal
    White, Adam
    White, Martha
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [8] Gradient Temporal-Difference Learning with Regularized Corrections
    Ghiassian, Sina
    Patterson, Andrew
    Garg, Shivam
    Gupta, Dhawal
    White, Adam
    White, Martha
    25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,
  • [9] Generalized gradient emphasis learning for off-policy evaluation and control with function approximation
    Jiaqing Cao
    Quan Liu
    Lan Wu
    Qiming Fu
    Shan Zhong
    Neural Computing and Applications, 2023, 35 : 23599 - 23616
  • [10] Two Time-Scale Stochastic Approximation with Controlled Markov Noise and Off-Policy Temporal-Difference Learning
    Karmakar, Prasenjit
    Bhatnagar, Shalabh
    MATHEMATICS OF OPERATIONS RESEARCH, 2018, 43 (01) : 130 - 151