Policy Gradient for Continuing Tasks in Discounted Markov Decision Processes

被引:0
|
作者
Paternain, Santiago [1 ]
Bazerque, Juan Andres [2 ,3 ]
Ribeiro, Alejandro [4 ]
机构
[1] Rensselaer Polytech Inst, Dept Elect Comp & Syst Engn, Troy, NY 12180 USA
[2] Univ Republica, Dept Elect Engn, Montevideo, Uruguay
[3] Univ Pittsburgh, Pittsburgh, PA 15260 USA
[4] Univ Penn, Dept Elect & Syst Engn, Philadelphia, PA 19104 USA
关键词
Task analysis; Trajectory; Convergence; Markov processes; Approximation algorithms; Transient analysis; Standards; Adaptive systems; gradient methods; reinforcement learning; stochastic systems;
D O I
10.1109/TAC.2022.3163085
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Reinforcement learning aims to find policies that maximize an expected cumulative reward in Markov decision processes with unknown transition probabilities. Policy gradient (PG)-algorithms use stochastic gradients of the value function to update the policy. A major drawback of PG-algorithms is that they are limited to episodic tasks (multiple finite-horizon trajectories) unless stringent stationarity assumptions are imposed on the trajectories. Hence, they need restarts and cannot be fully implemented online, which is critical for systems need to adapt to new tasks and/or environments in deployment. Moreover, the standard stationary formulation ignores transient behaviors. This motivates our study of discounted MDPs of infinite horizon without restarts. However, it is unknown if in this case following stochastic PG-type estimates would improve the policy. The main result of this work is to establish that when policies belong to a reproducing kernel Hilbert space (RKHS), and the kernel is selected properly, then these PG-estimates are ascent directions for the value function conditioned to any arbitrary initial point. This allows us to prove convergence of our online algorithm to the local optima. A numerical example shows that an agent running our online algorithm learns to navigate and succeeds in a surveillance task that requires looping between two goal locations. This example corroborates our theoretical findings about the ascent directions of subsequent stochastic gradients. It also shows how our online algorithm guides the agent through a continuing cyclic trajectory that does not comply with the standard stationarity assumptions in the literature for non-episodic training.
引用
收藏
页码:4467 / 4482
页数:16
相关论文
共 50 条
  • [1] The complexity of Policy Iteration is exponential for discounted Markov Decision Processes
    Hollanders, Romain
    Delvenne, Jean-Charles
    Jungers, Raphael M.
    2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 5997 - 6002
  • [2] Policy gradient in Lipschitz Markov Decision Processes
    Matteo Pirotta
    Marcello Restelli
    Luca Bascetta
    Machine Learning, 2015, 100 : 255 - 283
  • [3] Policy gradient in Lipschitz Markov Decision Processes
    Pirotta, Matteo
    Restelli, Marcello
    Bascetta, Luca
    MACHINE LEARNING, 2015, 100 (2-3) : 255 - 283
  • [4] Policy Gradient for Rectangular Robust Markov Decision Processes
    Kumar, Navdeep
    Derman, Esther
    Geist, Matthieu
    Levy, Kfir
    Mannor, Shie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] COMPUTATIONAL COMPARISON OF POLICY ITERATION ALGORITHMS FOR DISCOUNTED MARKOV DECISION PROCESSES.
    Hartley, R.
    Lavercombe, A.C.
    Thomas, L.C.
    1600, (13):
  • [6] COMPUTATIONAL COMPARISON OF POLICY ITERATION ALGORITHMS FOR DISCOUNTED MARKOV DECISION-PROCESSES
    HARTLEY, R
    LAVERCOMBE, AC
    THOMAS, LC
    COMPUTERS & OPERATIONS RESEARCH, 1986, 13 (04) : 411 - 420
  • [7] Discounted Markov decision processes with fuzzy costs
    Abdellatif Semmouri
    Mostafa Jourhmane
    Zineb Belhallaj
    Annals of Operations Research, 2020, 295 : 769 - 786
  • [8] Weighted discounted Markov decision processes with perturbation
    Liu Ke
    Acta Mathematicae Applicatae Sinica, 1999, 15 (2) : 183 - 189
  • [9] Discounted Markov decision processes with utility constraints
    Kadota, Y
    Kurano, M
    Yasuda, M
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2006, 51 (02) : 279 - 284
  • [10] Discounted Markov decision processes with fuzzy costs
    Semmouri, Abdellatif
    Jourhmane, Mostafa
    Belhallaj, Zineb
    ANNALS OF OPERATIONS RESEARCH, 2020, 295 (02) : 769 - 786