Reinforcement learning in continuous time and space

被引:586
|
作者
Doya, K [1 ]
机构
[1] ATR Human Informat Proc Res Labs, Kyoto 6190288, Japan
关键词
D O I
10.1162/089976600300015961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods-a continuous actor-critic method and a value-gradient-based greedy policy-are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.
引用
收藏
页码:219 / 245
页数:27
相关论文
共 50 条
  • [1] Barycentric interpolators for continuous space & time reinforcement learning
    Munos, R
    Moore, A
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11, 1999, 11 : 1024 - 1030
  • [2] Linear inverse reinforcement learning in continuous time and space
    Kamalapurkar, Rushikesh
    [J]. 2018 ANNUAL AMERICAN CONTROL CONFERENCE (ACC), 2018, : 1683 - 1688
  • [3] Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach
    Wang, Haoran
    Zariphopoulou, Thaleia
    Zhou, Xun Yu
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [4] Reinforcement learning in continuous time and space: A stochastic control approach
    Wang, Haoran
    Zariphopoulou, Thaleia
    Zhou, Xun Yu
    [J]. Journal of Machine Learning Research, 2020, 21
  • [5] Budgeted Reinforcement Learning in Continuous State Space
    Carrara, Nicolas
    Leurent, Edouard
    Laroche, Romain
    Urvoy, Tanguy
    Maillard, Odalric-Ambrym
    Pietquin, Olivier
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [6] On Applications of Bootstrap in Continuous Space Reinforcement Learning
    Faradonbeh, Mohamad Kazem Shirani
    Tewari, Ambuj
    Michailidis, George
    [J]. 2019 IEEE 58TH CONFERENCE ON DECISION AND CONTROL (CDC), 2019, : 1977 - 1984
  • [7] Switching reinforcement learning for continuous action space
    Nagayoshi, Masato
    Murao, Hajime
    Tamaki, Hisashi
    [J]. ELECTRONICS AND COMMUNICATIONS IN JAPAN, 2012, 95 (03) : 37 - 44
  • [8] Policy iterations for reinforcement learning problems in continuous time and space - Fundamental theory and methods
    Lee, Jaeyoung
    Sutton, Richard S.
    [J]. AUTOMATICA, 2021, 126
  • [9] Hierarchical Reinforcement Learning Based on Continuous Subgoal Space
    Wang, Chen
    Zeng, Fanyu
    Ge, Shuzhi Sam
    Jiang, Xin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON REAL-TIME COMPUTING AND ROBOTICS (IEEE-RCAR 2020), 2020, : 74 - 80
  • [10] A reinforcement learning with switching controllers for a continuous action space
    Nagayoshi, Masato
    Murao, Hajime
    Tamaki, Hisashi
    [J]. ARTIFICIAL LIFE AND ROBOTICS, 2010, 15 (01) : 97 - 100