Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control

被引:0
|
作者
Cao, Jiaqing [1 ]
Liu, Quan [1 ]
Wu, Lan [1 ]
Fu, Qiming [2 ]
Zhong, Shan [3 ]
机构
[1] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China
[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[3] Changshu Inst Technol, Sch Comp Sci & Engn, Changshu 215500, Peoples R China
基金
中国国家自然科学基金;
关键词
Reinforcement learning; Off-policy learning; Emphatic approach; Gradient temporal-difference learning; Gradient emphasis learning;
D O I
10.1007/s10489-023-04579-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy learning, where the goal is to learn about a policy of interest while following a different behavior policy, constitutes an important class of reinforcement learning problems. It is well-known that emphatic temporal-difference (TD) learning is a pioneering off-policy reinforcement learning method involving the use of the followon trace. Although the gradient emphasis learning (GEM) algorithm has recently been proposed to fix the problems of unbounded variance and large emphasis approximation error introduced by the followon trace from the perspective of stochastic approximation. This approach, however, is limited to a single gradient-TD2-style update instead of considering the update rules of other GTD algorithms. Overall, it remains an open question on how to better learn the emphasis for off-policy learning. In this paper, we rethink GEM and investigate introducing a novel two-time-scale algorithm called TD emphasis learning with gradient correction (TDEC) to learn the true emphasis. Further, we regularize the update to the secondary learning process of TDEC and obtain our final TD emphasis learning with regularized correction (TDERC) algorithm. We then apply the emphasis estimated by the proposed emphasis learning algorithms to the value estimation gradient and the policy gradient, respectively, yielding the corresponding emphatic TD variants for off-policy evaluation and actor-critic algorithms for off-policy control. Finally, we empirically demonstrate the advantage of the proposed algorithms on a small domain as well as challenging Mujoco robot simulation tasks. Taken together, we hope that our work can provide new insights into the development of a better alternative in the family of off-policy emphatic algorithms.
引用
收藏
页码:20917 / 20937
页数:21
相关论文
共 50 条
  • [41] Research on Off-Policy Evaluation in Reinforcement Learning: A Survey
    Wang S.-R.
    Niu W.-J.
    Tong E.-D.
    Chen T.
    Li H.
    Tian Y.-Z.
    Liu J.-Q.
    Han Z.
    Li Y.-D.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (09): : 1926 - 1945
  • [42] Off-Policy Reinforcement Learning for H∞ Control Design
    Luo, Biao
    Wu, Huai-Ning
    Huang, Tingwen
    IEEE TRANSACTIONS ON CYBERNETICS, 2015, 45 (01) : 65 - 76
  • [43] A Temporal-Difference Approach to Policy Gradient Estimation
    Tosatto, Samuele
    Patterson, Andrew
    White, Martha
    Mahmood, A. Rupam
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [44] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
    Thomas, Philip S.
    Brunskill, Emma
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [45] Universal Off-Policy Evaluation
    Chandak, Yash
    Niekum, Scott
    da Silva, Bruno Castro
    Learned-Miller, Erik
    Brunskill, Emma
    Thomas, Philip S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [46] Fast Link Scheduling in Wireless Networks Using Regularized Off-Policy Reinforcement Learning
    Bhattacharya, Sagnik
    Banerjee, Ayan
    Peruru, Subrahmanya Swamy
    Srinivas, Kothapalli Venkata
    IEEE Networking Letters, 2023, 5 (02): : 86 - 90
  • [47] Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
    Yin, Ming
    Wang, Yu-Xiang
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [48] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
    Weiwei Wang
    Yuqiang Li
    Xianyi Wu
    Statistics and Computing, 2024, 34
  • [49] Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning
    Kallus, Nathan
    Mao, Xiaojie
    Wang, Kaiwen
    Zhou, Zhengyuan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022, : 10598 - 10632
  • [50] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
    Shimizu, Tatsuhiro
    Tanaka, Koichi
    Kishimoto, Ren
    Kiyohara, Haruka
    Nomura, Masahiro
    Saito, Yuta
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741