Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

被引：0

作者：

Daley, Brett ^{[1
,2
]}

White, Martha ^{[1
,2
,3
]}

Amato, Christopher ^{[4
]}

Machado, Marlos C. ^{[1
,2
,3
]}

机构：

[1] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada

[2] Alberta Machine Intelligence Inst, Edmonton, AB, Canada

[3] Canada CIFAR AI Chair, Toronto, ON, Canada

[4] Northeastern Univ, Khoury Coll Comp Sci, Boston, MA USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202 | 2023年 / 202卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across lambda-values in several off-policy control tasks.

引用

页数：18

共 50 条

[1] Off-policy Learning With Eligibility Traces: A Survey
Geist, Matthieu
Scherrer, Bruno
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2014, 15 : 289 - 333
[2] Trajectory-Based Off-Policy Deep Reinforcement Learning
Doerr, Andreas
Volpp, Michael
Toussaint, Marc
Trimpe, Sebastian
Daniel, Christian
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[3] Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence
Stankovic, Milos S.
Beko, Marko
Stankovic, Srdjan S.
[J]. IFAC PAPERSONLINE, 2020, 53 (02): : 1563 - 1568
[4] VALUE-AWARE IMPORTANCE WEIGHTING FOR OFF-POLICY REINFORCEMENT LEARNING
De Asis, Kristopher
Graves, Eric
Sutton, Richard S.
[J]. CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 745 - 763
[5] Safe and efficient off-policy reinforcement learning
Munos, Remi
Stepleton, Thomas
Harutyunyan, Anna
Bellemare, Marc G.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[6] Bounds for Off-policy Prediction in Reinforcement Learning
Joseph, Ajin George
Bhatnagar, Shalabh
[J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3991 - 3997
[7] Off-Policy Reinforcement Learning with Gaussian Processes
Girish Chowdhary
Miao Liu
Robert Grande
Thomas Walsh
Jonathan How
Lawrence Carin
[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1 (03) : 227 - 238
[8] Off-Policy Reinforcement Learning with Delayed Rewards
Han, Beining
Ren, Zhizhou
Wu, Zuofan
Zhou, Yuan
Peng, Jian
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[9] A perspective on off-policy evaluation in reinforcement learning
Li, Lihong
[J]. FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
[10] A perspective on off-policy evaluation in reinforcement learning
Lihong Li
[J]. Frontiers of Computer Science, 2019, 13 : 911 - 912

← 1 2 3 4 5 →