Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

被引:0
|
作者
Jia, Yanwei [1 ]
Zhou, Xun Yu [1 ,2 ]
机构
[1] Columbia Univ, Dept Ind Engn & Operat Res, New York, NY 10027 USA
[2] Columbia Univ, Data Sci Inst, New York, NY 10027 USA
关键词
reinforcement learning; continuous time and space; policy gradient; policy evaluation; actor-critic algorithms; martingale; ERGODIC CONTROL; MULTIDIMENSIONAL DIFFUSIONS; PORTFOLIO SELECTION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.
引用
收藏
页数:50
相关论文
共 50 条
  • [41] Policy-Gradient and Actor-Critic Based State Representation Learning for Safe Driving of Autonomous Vehicles
    Gupta, Abhishek
    Khwaja, Ahmed Shaharyar
    Anpalagan, Alagan
    Guan, Ling
    Venkatesh, Bala
    [J]. SENSORS, 2020, 20 (21) : 1 - 20
  • [42] Learning the Filling Policy of a Biodegradation Process by Fuzzy Actor-Critic Learning Methodology
    Franco Flores, Efrain
    Waissman Vilanova, Julio
    Garcia Lamont, Jair
    [J]. MICAI 2008: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2008, 5317 : 243 - 253
  • [43] Speed Tracking Control via Online Continuous Actor-Critic learning
    Huang, Zhenhua
    Xu, Xin
    Sun, Zhenping
    Tan, Jun
    Qian, Lilin
    [J]. PROCEEDINGS OF THE 35TH CHINESE CONTROL CONFERENCE 2016, 2016, : 3172 - 3177
  • [44] Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning
    Shi, Daming
    Guo, Xudong
    Liu, Yi
    Fan, Wenhui
    [J]. ENTROPY, 2022, 24 (06)
  • [45] Boosting On-Policy Actor-Critic With Shallow Updates in Critic
    Li, Luntong
    Zhu, Yuanheng
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [46] Generalized Off-Policy Actor-Critic
    Zhang, Shangtong
    Boehmer, Wendelin
    Whiteson, Shimon
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [47] Q-LEARNING, POLICY ITERATION AND ACTOR-CRITIC REINFORCEMENT LEARNING COMBINED WITH METAHEURISTIC ALGORITHMS IN SERVO SYSTEM CONTROL
    Zamfirache, Iuliu Alexandru
    Precup, Radu-Emil
    Petriu, Emil M.
    [J]. FACTA UNIVERSITATIS-SERIES MECHANICAL ENGINEERING, 2023, 21 (04) : 615 - 630
  • [48] Practical Critic Gradient based Actor Critic for On-Policy Reinforcement Learning
    Gurumurthy, Swaminathan
    Manchester, Zachary
    Kolter, J. Zico
    [J]. LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
  • [49] Distributed Actor-Critic Algorithms for Multiagent Reinforcement Learning Over Directed Graphs
    Dai, Pengcheng
    Yu, Wenwu
    Wang, He
    Baldi, Simone
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (10) : 7210 - 7221
  • [50] Model Learning Actor-Critic Algorithms: Performance Evaluation in a Motion Control Task
    Grondman, Ivo
    Busoniu, Lucian
    Babuska, Robert
    [J]. 2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 5272 - 5277