Deep reinforcement learning using least-squares truncated temporal-difference

被引:3
|
作者
Ren, Junkai [1 ]
Lan, Yixing [1 ]
Xu, Xin [1 ]
Zhang, Yichuan [2 ]
Fang, Qiang [1 ]
Zeng, Yujun [1 ]
机构
[1] Natl Univ Def Technol, Coll Intelligence Sci & Technol, Changsha, Peoples R China
[2] Xian Satellite Control Ctr, State Key Lab Astronaut Dynam, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep reinforcement learning; policy evaluation; temporal difference; value function approximation;
D O I
10.1049/cit2.12202
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy evaluation (PE) is a critical sub-problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning ((LSTD)-D-2) is proposed. In (LSTD)-D-2, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre-training methods are utilised to improve the approximation ability of (LSTD)-D-2. Furthermore, an Actor-Critic algorithm based on (LSTD)-D-2 and pre-trained feature representations (ACLPF) is proposed, where (LSTD)-D-2 is integrated into the critic network to improve learning-prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of (LSTD)-D-2. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that (LSTD)-D-2 can be applied to online learning control problems by incorporating it into the actor-critic architecture.
引用
收藏
页码:425 / 439
页数:15
相关论文
共 50 条
  • [11] Postponed Updates for Temporal-Difference Reinforcement Learning
    van Seijen, Harm
    Whiteson, Shimon
    2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 665 - +
  • [12] Efficient reinforcement learning using recursive least-squares methods
    Xu, X
    He, HG
    Hu, DW
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2002, 16 : 259 - 292
  • [13] An enhanced least-squares approach for reinforcement learning
    Li, HL
    Dagli, CH
    PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2905 - 2909
  • [14] Least-Squares SARSA(λ) Algorithms for Reinforcement Learning
    Chen, Sheng-Lei
    Wei, Yan-Mei
    ICNC 2008: FOURTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 2, PROCEEDINGS, 2008, : 632 - +
  • [15] Least-squares methods in reinforcement learning for control
    Lagoudakis, MG
    Parr, R
    Littman, ML
    METHODS AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2002, 2308 : 249 - 260
  • [16] Hybrid Least-Squares methods for reinforcement learning
    Li, HL
    Dagli, CH
    DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 471 - 480
  • [17] Recursive Least-Squares Temporal Difference With Gradient Correction
    Song, Tianheng
    Li, Dazi
    Yang, Weimin
    Hirasawa, Kotaro
    IEEE TRANSACTIONS ON CYBERNETICS, 2021, 51 (08) : 4251 - 4264
  • [18] IMPROVING REINFORCEMENT LEARNING USING TEMPORAL-DIFFERENCE NETWORK EUROCON2009
    Karbasian, Habib
    Ahmadabadi, Majid N.
    Araabi, Babak N.
    EUROCON 2009: INTERNATIONAL IEEE CONFERENCE DEVOTED TO THE 150 ANNIVERSARY OF ALEXANDER S. POPOV, VOLS 1- 4, PROCEEDINGS, 2009, : 1716 - 1722
  • [19] Least-Squares Temporal Difference Learning with Eligibility Traces based on Regularized Extreme Learning Machine
    Li, Dazi
    Li, Luntong
    Song, Tianheng
    Jin, Qibing
    PROCEEDINGS OF THE 28TH CHINESE CONTROL AND DECISION CONFERENCE (2016 CCDC), 2016, : 6976 - 6981
  • [20] Correlation minimizing replay memory in temporal-difference reinforcement learning
    Ramicic, Mirza
    Bonarinib, Andrea
    NEUROCOMPUTING, 2020, 393 : 91 - 100