Approximate policy iteration: A survey and some new methods

被引:148
|
作者
Bertsekas D.P. [1 ]
机构
[1] Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge
来源
基金
美国国家科学基金会;
关键词
Aggregation; Chattering; Dynamic programming; Policy iteration; Projected equation; Regularization;
D O I
10.1007/s11768-011-1005-3
中图分类号
学科分类号
摘要
We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds. © 2011 South China University of Technology, Academy of Mathematics and Systems Science, Chinese Academy of Sciences and Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:310 / 335
页数:25
相关论文
共 50 条
  • [1] Approximate policy iteration:a survey and somenew methods
    Dimitri P.BERTSEKAS
    [J]. Control Theory and Technology, 2011, 9 (03) : 310 - 335
  • [2] Adaptive Approximate Policy Iteration
    Hao, Botao
    Lazic, Nevena
    Abbasi-Yadkori, Yasin
    Joulani, Pooria
    Szepesvari, Csaba
    [J]. 24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130 : 523 - 531
  • [3] Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach
    Metelli, Alberto Maria
    Pirotta, Matteo
    Calandriello, Daniele
    Restelli, Marcello
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22
  • [4] Safe policy iteration: A monotonically improving approximate policy iteration approach
    Metelli, Alberto Maria
    Pirotta, Matteo
    Calandriello, Daniele
    Restelli, Marcello
    [J]. Journal of Machine Learning Research, 2021, 22
  • [5] Approximate policy iteration with a policy language bias
    Fern, A
    Yoon, S
    Givan, R
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 847 - 854
  • [6] SOME PROJECTION-ITERATION METHODS OF THE APPROXIMATE CONSTRUCTION OF IMPLICIT FUNCTIONS
    ZLEPKO, PP
    [J]. DOPOVIDI AKADEMII NAUK UKRAINSKOI RSR SERIYA A-FIZIKO-MATEMATICHNI TA TECHNICHNI NAUKI, 1979, (12): : 979 - 981
  • [7] Projections for Approximate Policy Iteration Algorithms
    Akrour, Riad
    Pajarinen, Joni
    Peters, Jan
    Neumann, Gerhard
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [8] Rollout sampling approximate policy iteration
    Dimitrakakis, Christos
    Lagoudakis, Michail G.
    [J]. MACHINE LEARNING, 2008, 72 (03) : 157 - 171
  • [9] Rollout sampling approximate policy iteration
    Christos Dimitrakakis
    Michail G. Lagoudakis
    [J]. Machine Learning, 2008, 72 : 157 - 171
  • [10] Approximate Policy Iteration Schemes: A Comparison
    Scherrer, Bruno
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1314 - 1322