Deep reinforcement learning using least-squares truncated temporal-difference

被引:3
|
作者
Ren, Junkai [1 ]
Lan, Yixing [1 ]
Xu, Xin [1 ]
Zhang, Yichuan [2 ]
Fang, Qiang [1 ]
Zeng, Yujun [1 ]
机构
[1] Natl Univ Def Technol, Coll Intelligence Sci & Technol, Changsha, Peoples R China
[2] Xian Satellite Control Ctr, State Key Lab Astronaut Dynam, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep reinforcement learning; policy evaluation; temporal difference; value function approximation;
D O I
10.1049/cit2.12202
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy evaluation (PE) is a critical sub-problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning ((LSTD)-D-2) is proposed. In (LSTD)-D-2, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre-training methods are utilised to improve the approximation ability of (LSTD)-D-2. Furthermore, an Actor-Critic algorithm based on (LSTD)-D-2 and pre-trained feature representations (ACLPF) is proposed, where (LSTD)-D-2 is integrated into the critic network to improve learning-prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of (LSTD)-D-2. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that (LSTD)-D-2 can be applied to online learning control problems by incorporating it into the actor-critic architecture.
引用
收藏
页码:425 / 439
页数:15
相关论文
共 50 条
  • [1] Least-Squares temporal difference learning
    Boyan, JA
    MACHINE LEARNING, PROCEEDINGS, 1999, : 49 - 56
  • [2] Technical Update: Least-Squares Temporal Difference Learning
    Justin A. Boyan
    Machine Learning, 2002, 49 : 233 - 246
  • [4] A sparse kernel-based least-squares temporal difference algorithm for reinforcement learning
    Xu, Xin
    ADVANCES IN NATURAL COMPUTATION, PT 1, 2006, 4221 : 47 - 56
  • [5] Technical update: Least-squares temporal difference learning
    Boyan, JA
    MACHINE LEARNING, 2002, 49 (2-3) : 233 - 246
  • [6] Multikernel Recursive Least-Squares Temporal Difference Learning
    Zhang, Chunyuan
    Zhu, Qingxin
    Niu, Xinzheng
    INTELLIGENT COMPUTING METHODOLOGIES, ICIC 2016, PT III, 2016, 9773 : 205 - 217
  • [7] Linear least-squares algorithms for temporal difference learning
    Bradtke, SJ
    Barto, AG
    MACHINE LEARNING, 1996, 22 (1-3) : 33 - 57
  • [8] Least-squares temporal difference learning based on an extreme learning machine
    Escandell-Montero, Pablo
    Martinez-Martinez, Jose M.
    Martin-Guerrero, Jose D.
    Soria-Olivas, Emilio
    Gomez-Sanchis, Juan
    NEUROCOMPUTING, 2014, 141 : 37 - 45
  • [9] Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator
    Tu, Stephen
    Recht, Benjamin
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [10] Temporal-Difference Reinforcement Learning with Distributed Representations
    Kurth-Nelson, Zeb
    Redish, A. David
    PLOS ONE, 2009, 4 (10):