Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes

被引：17

作者：

Lan, Guanghui ^{[1
]}

机构：

[1] Georgia Inst Technol, H Milton Stewart Sch Ind & Syst Engn, Atlanta, GA 30332 USA

来源：

MATHEMATICAL PROGRAMMING | 2023年 / 198卷 / 01期

基金：

美国国家科学基金会;

关键词：

60J10; 90C15; 90C30; 90C40;

D O I：

10.1007/s10107-022-01816-5

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an O(1/epsilon) (resp., O(1/epsilon(2))) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where epsilon denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by O{(log(gamma) epsilon)[(1 - gamma)L/mu](1/2) log(1/epsilon)} (resp., O{(log(gamma) epsilon) (L/epsilon)(1/2)}) for problems with strongly (resp., general) convex regularizers. Here gamma denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly enhances the flexibility and thus expands the applicability of RL models.

引用

页码：1059 / 1106

页数：48

共 10 条

[1] Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes
Guanghui Lan
[J]. Mathematical Programming, 2023, 198 : 1059 - 1106
[2] POLICY MIRROR DESCENT FOR REGULARIZED REINFORCEMENT LEARNING: A GENERALIZED FRAMEWORK WITH LINEAR CONVERGENCE
Zhan, Wenhao
Cen, Shicong
Huang, Baihe
Chen, Yuxin
Lee, Jason D.
Chi, Yuejie
[J]. SIAM JOURNAL ON OPTIMIZATION, 2023, 33 (02) : 1061 - 1091
[3] Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity
Li, Yan
Lan, Guanghui
Zhao, Tuo
[J]. MATHEMATICAL PROGRAMMING, 2024, 207 (1-2) : 457 - 513
[4] Convergence and Iteration Complexity of Policy Gradient Method for Infinite-horizon Reinforcement Learning
Zhang, Kaiqing
Koppel, Alec
Zhu, Hao
Basar, Tamer
[J]. 2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 7415 - 7422
[5] Off-policy learning based on weighted importance sampling with linear computational complexity
Mahmood, A. Rupam
Sutton, Richard S.
[J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 552 - 561
[6] Reinforcement Learning in Linear Quadratic Deep Structured Teams: Global Convergence of Policy Gradient Methods
Fathi, Vida
Arabneydi, Jalal
Aghdam, Amir G.
[J]. 2020 59TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2020, : 4927 - 4932
[7] Reinforcement Learning in Nonzero-sum Linear Quadratic Deep Structured Games: Global Convergence of Policy Optimization
Roudneshin, Masoud
Arabneydi, Jalal
Aghdam, Amir G.
[J]. 2020 59TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2020, : 512 - 517
[8] Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Buchi Automata
Oura, Ryohei
Sakakibara, Ami
Ushio, Toshimitsu
[J]. IEEE CONTROL SYSTEMS LETTERS, 2020, 4 (03): : 761 - 766
[9] Challenging the Limits of Binarization: A New Scheme Selection Policy Using Reinforcement Learning Techniques for Binary Combinatorial Problem Solving
Becerra-Rozas, Marcelo
Crawford, Broderick
Soto, Ricardo
Talbi, El-Ghazali
Gomez-Pulido, Jose M.
[J]. BIOMIMETICS, 2024, 9 (02)
[10] Policy Gradient Reinforcement Learning Method for Discrete-Time Linear Quadratic Regulation Problem Using Estimated State Value Function
Sasaki, Tomotake
Uchibe, Eiji
Iwane, Hidenao
Yanami, Hitoshi
Anai, Hirokazu
Doya, Kenji
[J]. 2017 56TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2017, : 653 - 657

← 1 →