Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

被引：0

作者：

Penedones, Hugo ^{[1
]}

Riquelme, Carlos ^{[2
]}

Vincent, Damien ^{[2
]}

Maennel, Hartmut ^{[2
]}

Mann, Timothy ^{[1
]}

Barreto, Andre ^{[1
]}

Gelly, Sylvain ^{[2
]}

Neu, Gergely ^{[3
]}

机构：

[1] DeepMind, London, England

[2] Google Brain, Mountain View, CA 94043 USA

[3] Univ Pompeu Fabra, Barcelona, Spain

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.

引用

页数：11

共 50 条

[1] Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach
Jia, Yanwei
Zhou, Xun Yu
JOURNAL OF MACHINE LEARNING RESEARCH, 2022, 23
[2] Gradient temporal-difference learning for off-policy evaluation using emphatic weightings
Cao, Jiaqing
Liu, Quan
Zhu, Fei
Fu, Qiming
Zhong, Shan
INFORMATION SCIENCES, 2021, 580 : 311 - 330
[3] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
Cao, Jiaqing
Liu, Quan
Wu, Lan
Fu, Qiming
Zhong, Shan
APPLIED INTELLIGENCE, 2023, 53 (18) : 20917 - 20937
[4] Temporal-difference emphasis learning with regularized correction for off-policy evaluation and control
Jiaqing Cao
Quan Liu
Lan Wu
Qiming Fu
Shan Zhong
Applied Intelligence, 2023, 53 : 20917 - 20937
[5] An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
Sutton, Richard S.
Mahmood, A. Rupam
White, Martha
JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[6] Temporal-difference learning and applications in finance
Van Roy, B
COMPUTATIONAL FINANCE 1999, 2000, : 447 - 461
[7] Average cost temporal-difference learning
Tsitsiklis, JN
Van Roy, B
PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 498 - 502
[8] True Online Temporal-Difference Learning
van Seijen, Harm
Mahmood, A. Rupam
Pilarski, Patrick M.
Machado, Marlos C.
Sutton, Richard S.
JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[9] Average cost temporal-difference learning
Tsitsiklis, JN
Van Roy, B
AUTOMATICA, 1999, 35 (11) : 1799 - 1808
[10] Average cost temporal-difference learning
Lab. for Info. and Decision Systems, Massachusetts Inst. of Technology, Room 35-209, 77 Massachusetts Avenue, Cambridge, MA 02139-4307, United States
Automatica, 11 (1799-1808):

← 1 2 3 4 5 →