Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

被引:0
|
作者
Daley, Brett [1 ,2 ]
White, Martha [1 ,2 ,3 ]
Amato, Christopher [4 ]
Machado, Marlos C. [1 ,2 ,3 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
[2] Alberta Machine Intelligence Inst, Edmonton, AB, Canada
[3] Canada CIFAR AI Chair, Toronto, ON, Canada
[4] Northeastern Univ, Khoury Coll Comp Sci, Boston, MA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across lambda-values in several off-policy control tasks.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Off-policy Learning With Eligibility Traces: A Survey
    Geist, Matthieu
    Scherrer, Bruno
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2014, 15 : 289 - 333
  • [2] Trajectory-Based Off-Policy Deep Reinforcement Learning
    Doerr, Andreas
    Volpp, Michael
    Toussaint, Marc
    Trimpe, Sebastian
    Daniel, Christian
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [3] Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    [J]. IFAC PAPERSONLINE, 2020, 53 (02): : 1563 - 1568
  • [4] VALUE-AWARE IMPORTANCE WEIGHTING FOR OFF-POLICY REINFORCEMENT LEARNING
    De Asis, Kristopher
    Graves, Eric
    Sutton, Richard S.
    [J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 745 - 763
  • [5] Safe and efficient off-policy reinforcement learning
    Munos, Remi
    Stepleton, Thomas
    Harutyunyan, Anna
    Bellemare, Marc G.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [6] Bounds for Off-policy Prediction in Reinforcement Learning
    Joseph, Ajin George
    Bhatnagar, Shalabh
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3991 - 3997
  • [7] Off-Policy Reinforcement Learning with Gaussian Processes
    Girish Chowdhary
    Miao Liu
    Robert Grande
    Thomas Walsh
    Jonathan How
    Lawrence Carin
    [J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1 (03) : 227 - 238
  • [8] Off-Policy Reinforcement Learning with Delayed Rewards
    Han, Beining
    Ren, Zhizhou
    Wu, Zuofan
    Zhou, Yuan
    Peng, Jian
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [9] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [10] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    [J]. Frontiers of Computer Science, 2019, 13 : 911 - 912