Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms

被引:0
|
作者
Jia, Yanwei [1 ]
Zhou, Xun Yu [1 ,2 ]
机构
[1] Columbia Univ, Dept Ind Engn & Operat Res, New York, NY 10027 USA
[2] Columbia Univ, Data Sci Inst, New York, NY 10027 USA
关键词
reinforcement learning; continuous time and space; policy gradient; policy evaluation; actor-critic algorithms; martingale; ERGODIC CONTROL; MULTIDIMENSIONAL DIFFUSIONS; PORTFOLIO SELECTION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.
引用
收藏
页数:50
相关论文
共 50 条
  • [1] Policy Gradient and Actor–Critic Learning in Continuous Time and Space: Theory and Algorithms
    Jia, Yanwei
    Zhou, Xun Yu
    [J]. Journal of Machine Learning Research, 2022, 23
  • [2] Bayesian Policy Gradient and Actor-Critic Algorithms
    Ghavamzadeh, Mohammad
    Engel, Yaakov
    Valko, Michal
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [3] Policy-Gradient Based Actor-Critic Algorithms
    Awate, Yogesh P.
    [J]. PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 505 - 509
  • [4] Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
    Laroche, Romain
    des Combes, Remi Tachet
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5658 - 5688
  • [5] Actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 12, 2000, 12 : 1008 - 1014
  • [6] On actor-critic algorithms
    Konda, VR
    Tsitsiklis, JN
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 2003, 42 (04) : 1143 - 1166
  • [7] Algorithms for Variance Reduction in a Policy-Gradient Based Actor-Critic Framework
    Awate, Yogesh P.
    [J]. ADPRL: 2009 IEEE SYMPOSIUM ON ADAPTIVE DYNAMIC PROGRAMMING AND REINFORCEMENT LEARNING, 2009, : 130 - 136
  • [8] Characterizing the Gap Between Actor-Critic and Policy Gradient
    Wen, Junfeng
    Kumar, Saurabh
    Gummadi, Ramki
    Schuurmans, Dale
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [9] A Robust Approach for Continuous Interactive Actor-Critic Algorithms
    Millan-Arias, Cristian C.
    Fernandes, Bruno J. T.
    Cruz, Francisco
    Dazeley, Richard
    Fernandes, Sergio
    [J]. IEEE ACCESS, 2021, 9 : 104242 - 104260
  • [10] Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation
    Li, Luntong
    Li, Dazi
    Song, Tianheng
    Xu, Xin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (03) : 1217 - 1227