Deep reinforcement learning using least-squares truncated temporal-difference

被引:3
|
作者
Ren, Junkai [1 ]
Lan, Yixing [1 ]
Xu, Xin [1 ]
Zhang, Yichuan [2 ]
Fang, Qiang [1 ]
Zeng, Yujun [1 ]
机构
[1] Natl Univ Def Technol, Coll Intelligence Sci & Technol, Changsha, Peoples R China
[2] Xian Satellite Control Ctr, State Key Lab Astronaut Dynam, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep reinforcement learning; policy evaluation; temporal difference; value function approximation;
D O I
10.1049/cit2.12202
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy evaluation (PE) is a critical sub-problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning ((LSTD)-D-2) is proposed. In (LSTD)-D-2, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre-training methods are utilised to improve the approximation ability of (LSTD)-D-2. Furthermore, an Actor-Critic algorithm based on (LSTD)-D-2 and pre-trained feature representations (ACLPF) is proposed, where (LSTD)-D-2 is integrated into the critic network to improve learning-prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of (LSTD)-D-2. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that (LSTD)-D-2 can be applied to online learning control problems by incorporating it into the actor-critic architecture.
引用
收藏
页码:425 / 439
页数:15
相关论文
共 50 条
  • [31] Temporal-difference learning and applications in finance
    Van Roy, B
    COMPUTATIONAL FINANCE 1999, 2000, : 447 - 461
  • [32] Average cost temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 498 - 502
  • [33] Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning
    He, Qiang
    Zhou, Tianyi
    Fang, Meng
    Maghsudi, Setareh
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV, 2023, 14172 : 573 - 589
  • [34] True Online Temporal-Difference Learning
    van Seijen, Harm
    Mahmood, A. Rupam
    Pilarski, Patrick M.
    Machado, Marlos C.
    Sutton, Richard S.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2016, 17
  • [35] Average cost temporal-difference learning
    Tsitsiklis, JN
    Van Roy, B
    AUTOMATICA, 1999, 35 (11) : 1799 - 1808
  • [36] Average cost temporal-difference learning
    Lab. for Info. and Decision Systems, Massachusetts Inst. of Technology, Room 35-209, 77 Massachusetts Avenue, Cambridge, MA 02139-4307, United States
    Automatica, 11 (1799-1808):
  • [37] An Analysis of Quantile Temporal-Difference Learning
    Rowland, Mark
    Munos, Remi
    Azar, Mohammad Gheshlaghi
    Tang, Yunhao
    Ostrovski, Georg
    Harutyunyan, Anna
    Tuyls, Karl
    Bellemare, Marc G.
    Dabney, Will
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25
  • [38] Reinforcement Learning for Dialog Management using Least-Squares Policy Iteration and Fast Feature Selection
    Li, Lihong
    Williams, Jason D.
    Balakrishnan, Suhrid
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2447 - +
  • [39] A Least-Squares Temporal Difference based method for solving resource allocation problems
    Forootani, Ali
    Tipaldi, Massimo
    Zarch, Majid Ghaniee
    Liuzza, Davide
    Glielmo, Luigi
    IFAC JOURNAL OF SYSTEMS AND CONTROL, 2020, 13
  • [40] Faster SVD-Truncated Regularized Least-Squares
    Boutsidis, Christos
    Magdon-Ismail, Malik
    2014 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2014, : 1321 - 1325