Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

被引:0
|
作者
Penedones, Hugo [1 ]
Riquelme, Carlos [2 ]
Vincent, Damien [2 ]
Maennel, Hartmut [2 ]
Mann, Timothy [1 ]
Barreto, Andre [1 ]
Gelly, Sylvain [2 ]
Neu, Gergely [3 ]
机构
[1] DeepMind, London, England
[2] Google Brain, Mountain View, CA 94043 USA
[3] Univ Pompeu Fabra, Barcelona, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Using temporal-difference learning for multi-agent bargaining
    Huang, Shiu-li
    Lin, Fu-ren
    ELECTRONIC COMMERCE RESEARCH AND APPLICATIONS, 2008, 7 (04) : 432 - 442
  • [42] Temporal-Difference Learning An Online Support Vector Regression Approach
    Teixeira, Hugo Tanzarella
    Bottura, Celso Pascoli
    ICIMCO 2015 PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INFORMATICS IN CONTROL, AUTOMATION AND ROBOTICS, VOL. 1, 2015, : 318 - 323
  • [43] Temporal-Difference Q-learning in Active Fault Diagnosis
    Skach, Jan
    Puncochar, Ivo
    Lewis, Frank L.
    2016 3RD CONFERENCE ON CONTROL AND FAULT-TOLERANT SYSTEMS (SYSTOL), 2016, : 287 - 292
  • [44] Correlation minimizing replay memory in temporal-difference reinforcement learning
    Ramicic, Mirza
    Bonarinib, Andrea
    NEUROCOMPUTING, 2020, 393 : 91 - 100
  • [45] Implementing temporal-difference learning with the scaled conjugate gradient algorithm
    Falas, T
    Stafylopatis, A
    NEURAL PROCESSING LETTERS, 2005, 22 (03) : 361 - 375
  • [46] Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm
    Tasos Falas
    Andreas Stafylopatis
    Neural Processing Letters, 2005, 22 : 361 - 375
  • [47] Fuzzy interpretation for temporal-difference learning in anomaly detection problems
    Sukhanov, A. V.
    Kovalev, S. M.
    Styskala, V.
    BULLETIN OF THE POLISH ACADEMY OF SCIENCES-TECHNICAL SCIENCES, 2016, 64 (03) : 625 - 632
  • [48] Online Multi-Task Gradient Temporal-Difference Learning
    Sreenivasan, Vishnu Purushothaman
    Ammar, Haitham Bou
    Eaton, Eric
    PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 3136 - 3137
  • [49] Fuzzy interpretation for temporal-difference learning in anomaly detection problems
    Sukhanov A.V.
    Kovalev S.M.
    Stýskala V.
    Sukhanov, A.V. (drewnia@rambler.ru), 1600, Polska Akademia Nauk (64): : 625 - 632
  • [50] Distributed Consensus-Based Multi-Agent Off-Policy Temporal-Difference Learning
    Stankovic, Milos S.
    Beko, Marko
    Stankovic, Srdjan S.
    2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2021, : 5976 - 5981