Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

被引：0

作者：

Jia, Yanwei ^{[1
]}

Zhou, Xun Yu ^{[1
,2
]}

机构：

[1] Columbia Univ, Dept Ind Engn & Operat Res, New York, NY 10027 USA

[2] Columbia Univ, Data Sci Inst, New York, NY 10027 USA

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2022年 / 23卷

关键词：

reinforcement learning; continuous time and space; policy gradient; policy evaluation; actor-critic algorithms; martingale; ERGODIC CONTROL; MULTIDIMENSIONAL DIFFUSIONS; PORTFOLIO SELECTION;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

引用

页数：50

共 50 条

[41] Policy-Gradient and Actor-Critic Based State Representation Learning for Safe Driving of Autonomous Vehicles
Gupta, Abhishek
Khwaja, Ahmed Shaharyar
Anpalagan, Alagan
Guan, Ling
Venkatesh, Bala
[J]. SENSORS, 2020, 20 (21) : 1 - 20
[42] Learning the Filling Policy of a Biodegradation Process by Fuzzy Actor-Critic Learning Methodology
Franco Flores, Efrain
Waissman Vilanova, Julio
Garcia Lamont, Jair
[J]. MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2008, 5317 : 243 - 253
[43] Speed Tracking Control via Online Continuous Actor-Critic learning
Huang, Zhenhua
Xu, Xin
Sun, Zhenping
Tan, Jun
Qian, Lilin
[J]. PROCEEDINGS OF THE 35TH CHINESE CONTROL CONFERENCE 2016, 2016, : 3172 - 3177
[44] Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning
Shi, Daming
Guo, Xudong
Liu, Yi
Fan, Wenhui
[J]. ENTROPY, 2022, 24 (06)
[45] Boosting On-Policy Actor-Critic With Shallow Updates in Critic
Li, Luntong
Zhu, Yuanheng
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
[46] Generalized Off-Policy Actor-Critic
Zhang, Shangtong
Boehmer, Wendelin
Whiteson, Shimon
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[47] Q-LEARNING, POLICY ITERATION AND ACTOR-CRITIC REINFORCEMENT LEARNING COMBINED WITH METAHEURISTIC ALGORITHMS IN SERVO SYSTEM CONTROL
Zamfirache, Iuliu Alexandru
Precup, Radu-Emil
Petriu, Emil M.
[J]. FACTA UNIVERSITATIS-SERIES MECHANICAL ENGINEERING, 2023, 21 (04) : 615 - 630
[48] Practical Critic Gradient based Actor Critic for On-Policy Reinforcement Learning
Gurumurthy, Swaminathan
Manchester, Zachary
Kolter, J. Zico
[J]. LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
[49] Distributed Actor-Critic Algorithms for Multiagent Reinforcement Learning Over Directed Graphs
Dai, Pengcheng
Yu, Wenwu
Wang, He
Baldi, Simone
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (10) : 7210 - 7221
[50] Model Learning Actor-Critic Algorithms: Performance Evaluation in a Motion Control Task
Grondman, Ivo
Busoniu, Lucian
Babuska, Robert
[J]. 2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 5272 - 5277

← 1 2 3 4 5 →