Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

被引：0

作者：

Jia, Yanwei ^{[1
]}

Zhou, Xun Yu ^{[1
,2
]}

机构：

[1] Columbia Univ, Dept Ind Engn & Operat Res, New York, NY 10027 USA

[2] Columbia Univ, Data Sci Inst, New York, NY 10027 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2022年 / 23卷

关键词：

reinforcement learning; continuous time and space; policy gradient; policy evaluation; actor-critic algorithms; martingale; ERGODIC CONTROL; MULTIDIMENSIONAL DIFFUSIONS; PORTFOLIO SELECTION;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

引用

页数：50

共 50 条

[1] Policy Gradient and Actor–Critic Learning in Continuous Time and Space: Theory and Algorithms
Jia, Yanwei
Zhou, Xun Yu
[J]. Journal of Machine Learning Research, 2022, 23
[2] Bayesian Policy Gradient and Actor-Critic Algorithms
Ghavamzadeh, Mohammad
Engel, Yaakov
Valko, Michal
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
[3] Policy-Gradient Based Actor-Critic Algorithms
Awate, Yogesh P.
[J]. PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 505 - 509
[4] Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
Laroche, Romain
des Combes, Remi Tachet
[J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5658 - 5688
[5] Actor-critic algorithms
Konda, VR
Tsitsiklis, JN
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
[6] On actor-critic algorithms
Konda, VR
Tsitsiklis, JN
[J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166
[7] Algorithms for Variance Reduction in a Policy-Gradient Based Actor-Critic Framework
Awate, Yogesh P.
[J]. ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 130 - 136
[8] Characterizing the Gap Between Actor-Critic and Policy Gradient
Wen, Junfeng
Kumar, Saurabh
Gummadi, Ramki
Schuurmans, Dale
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[9] A Robust Approach for Continuous Interactive Actor-Critic Algorithms
Millan-Arias, Cristian C.
Fernandes, Bruno J. T.
Cruz, Francisco
Dazeley, Richard
Fernandes, Sergio
[J]. IEEE ACCESS, 2021, 9 : 104242 - 104260
[10] Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation
Li, Luntong
Li, Dazi
Song, Tianheng
Xu, Xin
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (03) : 1217 - 1227

← 1 2 3 4 5 →