Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

被引:0
|
作者
Penedones, Hugo [1 ]
Riquelme, Carlos [2 ]
Vincent, Damien [2 ]
Maennel, Hartmut [2 ]
Mann, Timothy [1 ]
Barreto, Andre [1 ]
Gelly, Sylvain [2 ]
Neu, Gergely [3 ]
机构
[1] DeepMind, London, England
[2] Google Brain, Mountain View, CA 94043 USA
[3] Univ Pompeu Fabra, Barcelona, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Provable distributed adaptive temporal-difference learning over time-varying networks*
    Zhu, Junlong
    Li, Bing
    Wang, Lin
    Zhang, Mingchuan
    Xing, Ling
    Xi, Jiangtao
    Wu, Qingtao
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 228
  • [33] On the convergence of temporal-difference learning with linear function approximation
    Tadic, V
    MACHINE LEARNING, 2001, 42 (03) : 241 - 267
  • [34] On average versus discounted reward temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    MACHINE LEARNING, 2002, 49 (2-3) : 179 - 191
  • [35] Optimal Active Fault Diagnosis by Temporal-Difference Learning
    Skach, Jan
    Puncochar, Ivo
    Lewis, Frank L.
    2016 IEEE 55TH CONFERENCE ON DECISION AND CONTROL (CDC), 2016, : 2146 - 2151
  • [36] On Average Versus Discounted Reward Temporal-Difference Learning
    John N. Tsitsiklis
    Benjamin Van Roy
    Machine Learning, 2002, 49 : 179 - 191
  • [37] Temporal-Difference Learning with Sampling Baseline for Image Captioning
    Chen, Hui
    Ding, Guiguang
    Zhao, Sicheng
    Han, Jungong
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6706 - 6713
  • [38] On the Convergence of Temporal-Difference Learning with Linear Function Approximation
    Vladislav Tadić
    Machine Learning, 2001, 42 : 241 - 267
  • [39] Neural Temporal-Difference Learning Converges to Global Optima
    Cai, Qi
    Yang, Zhuoran
    Lee, Jason D.
    Wang, Zhaoran
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [40] On the asymptotic behavior of a constant stepsize temporal-difference learning algorithm
    Tadic, A
    COMPUTATIONAL LEARNING THEORY, 1999, 1572 : 126 - 137