Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

被引:17
|
作者
Lan, Guanghui [1 ]
机构
[1] Georgia Inst Technol, H Milton Stewart Sch Ind & Syst Engn, Atlanta, GA 30332 USA
基金
美国国家科学基金会;
关键词
60J10; 90C15; 90C30; 90C40;
D O I
10.1007/s10107-022-01816-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an O(1/epsilon) (resp., O(1/epsilon(2))) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where epsilon denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by O{(log(gamma) epsilon)[(1 - gamma)L/mu](1/2) log(1/epsilon)} (resp., O{(log(gamma) epsilon) (L/epsilon)(1/2)}) for problems with strongly (resp., general) convex regularizers. Here gamma denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.
引用
收藏
页码:1059 / 1106
页数:48
相关论文
共 10 条
  • [1] Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes
    Guanghui Lan
    [J]. Mathematical Programming, 2023, 198 : 1059 - 1106
  • [2] POLICY MIRROR DESCENT FOR REGULARIZED REINFORCEMENT LEARNING: A GENERALIZED FRAMEWORK WITH LINEAR CONVERGENCE
    Zhan, Wenhao
    Cen, Shicong
    Huang, Baihe
    Chen, Yuxin
    Lee, Jason D.
    Chi, Yuejie
    [J]. SIAM JOURNAL ON OPTIMIZATION, 2023, 33 (02) : 1061 - 1091
  • [3] Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity
    Li, Yan
    Lan, Guanghui
    Zhao, Tuo
    [J]. MATHEMATICAL PROGRAMMING, 2024, 207 (1-2) : 457 - 513
  • [4] Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning
    Zhang, Kaiqing
    Koppel, Alec
    Zhu, Hao
    Basar, Tamer
    [J]. 2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 7415 - 7422
  • [5] Off-policy learning based on weighted importance sampling with linear computational complexity
    Mahmood, A. Rupam
    Sutton, Richard S.
    [J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 552 - 561
  • [6] Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods
    Fathi, Vida
    Arabneydi, Jalal
    Aghdam, Amir G.
    [J]. 2020 59TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2020, : 4927 - 4932
  • [7] Reinforcement Learning in Nonzero-sum Linear Quadratic Deep Structured Games: Global Convergence of Policy Optimization
    Roudneshin, Masoud
    Arabneydi, Jalal
    Aghdam, Amir G.
    [J]. 2020 59TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2020, : 512 - 517
  • [8] Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Buchi Automata
    Oura, Ryohei
    Sakakibara, Ami
    Ushio, Toshimitsu
    [J]. IEEE CONTROL SYSTEMS LETTERS, 2020, 4 (03): : 761 - 766
  • [9] Challenging the Limits of Binarization: A New Scheme Selection Policy Using Reinforcement Learning Techniques for Binary Combinatorial Problem Solving
    Becerra-Rozas, Marcelo
    Crawford, Broderick
    Soto, Ricardo
    Talbi, El-Ghazali
    Gomez-Pulido, Jose M.
    [J]. BIOMIMETICS, 2024, 9 (02)
  • [10] Policy Gradient Reinforcement Learning Method for Discrete-Time Linear Quadratic Regulation Problem Using Estimated State Value Function
    Sasaki, Tomotake
    Uchibe, Eiji
    Iwane, Hidenao
    Yanami, Hitoshi
    Anai, Hirokazu
    Doya, Kenji
    [J]. 2017 56TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2017, : 653 - 657