Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

被引：0

作者：

Penedones, Hugo ^{[1
]}

Riquelme, Carlos ^{[2
]}

Vincent, Damien ^{[2
]}

Maennel, Hartmut ^{[2
]}

Mann, Timothy ^{[1
]}

Barreto, Andre ^{[1
]}

Gelly, Sylvain ^{[2
]}

Neu, Gergely ^{[3
]}

机构：

[1] DeepMind, London, England

[2] Google Brain, Mountain View, CA 94043 USA

[3] Univ Pompeu Fabra, Barcelona, Spain

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019) | 2019年 / 32卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.

引用

页数：11

共 50 条

[31] Provable distributed adaptive temporal-difference learning over time-varying networks*
Zhu, Junlong
Li, Bing
Wang, Lin
Zhang, Mingchuan
Xing, Ling
Xi, Jiangtao
Wu, Qingtao
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 228
[32] A temporal-difference learning method using gaussian state representation for continuous state space problems
1600, Japanese Society for Artificial Intelligence (29):
[33] On the convergence of temporal-difference learning with linear function approximation
Tadic, V
MACHINE LEARNING, 2001, 42 (03) : 241 - 267
[34] On average versus discounted reward temporal-difference learning
Tsitsiklis, JN
Van Roy, B
MACHINE LEARNING, 2002, 49 (2-3) : 179 - 191
[35] Optimal Active Fault Diagnosis by Temporal-Difference Learning
Skach, Jan
Puncochar, Ivo
Lewis, Frank L.
2016 IEEE 55TH CONFERENCE ON DECISION AND CONTROL (CDC), 2016, : 2146 - 2151
[36] On Average Versus Discounted Reward Temporal-Difference Learning
John N. Tsitsiklis
Benjamin Van Roy
Machine Learning, 2002, 49 : 179 - 191
[37] Temporal-Difference Learning with Sampling Baseline for Image Captioning
Chen, Hui
Ding, Guiguang
Zhao, Sicheng
Han, Jungong
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6706 - 6713
[38] On the Convergence of Temporal-Difference Learning with Linear Function Approximation
Vladislav Tadić
Machine Learning, 2001, 42 : 241 - 267
[39] Neural Temporal-Difference Learning Converges to Global Optima
Cai, Qi
Yang, Zhuoran
Lee, Jason D.
Wang, Zhaoran
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[40] On the asymptotic behavior of a constant stepsize temporal-difference learning algorithm
Tadic, A
COMPUTATIONAL LEARNING THEORY, 1999, 1572 : 126 - 137

← 1 2 3 4 5 →