Deep reinforcement learning using least-squares truncated temporal-difference

被引:3
|
作者
Ren, Junkai [1 ]
Lan, Yixing [1 ]
Xu, Xin [1 ]
Zhang, Yichuan [2 ]
Fang, Qiang [1 ]
Zeng, Yujun [1 ]
机构
[1] Natl Univ Def Technol, Coll Intelligence Sci & Technol, Changsha, Peoples R China
[2] Xian Satellite Control Ctr, State Key Lab Astronaut Dynam, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep reinforcement learning; policy evaluation; temporal difference; value function approximation;
D O I
10.1049/cit2.12202
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Policy evaluation (PE) is a critical sub-problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least-Squares Truncated Temporal-Difference learning ((LSTD)-D-2) is proposed. In (LSTD)-D-2, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least-Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre-training methods are utilised to improve the approximation ability of (LSTD)-D-2. Furthermore, an Actor-Critic algorithm based on (LSTD)-D-2 and pre-trained feature representations (ACLPF) is proposed, where (LSTD)-D-2 is integrated into the critic network to improve learning-prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of (LSTD)-D-2. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that (LSTD)-D-2 can be applied to online learning control problems by incorporating it into the actor-critic architecture.
引用
收藏
页码:425 / 439
页数:15
相关论文
共 50 条
  • [21] Recursive least-squares temporal difference learning for adaptive traffic signal control at intersection
    Biao Yin
    Mahjoub Dridi
    Abdellah El Moudni
    Neural Computing and Applications, 2019, 31 : 1013 - 1028
  • [22] Recursive least-squares temporal difference learning for adaptive traffic signal control at intersection
    Yin, Biao
    Dridi, Mahjoub
    El Moudni, Abdellah
    NEURAL COMPUTING & APPLICATIONS, 2019, 31 (Suppl 2): : 1013 - 1028
  • [23] Uncorrected Least-Squares Temporal Difference with Lambda-Return
    Osogami, Takayuki
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 5323 - 5330
  • [24] Online least-squares policy iteration for reinforcement learning control
    Busoniu, Lucian
    Ernst, Damien
    De Schutter, Bart
    Babuska, Robert
    2010 AMERICAN CONTROL CONFERENCE, 2010, : 486 - 491
  • [25] An efficient L2-norm regularized least-squares temporal difference learning algorithm
    Chen, Shenglei
    Chen, Geng
    Gu, Ruijun
    KNOWLEDGE-BASED SYSTEMS, 2013, 45 : 94 - 99
  • [26] Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization
    Zhang, Chunyuan
    Zhu, Qingxin
    Niu, Xinzheng
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2016, 2016
  • [27] An Actor-Critic Method Using Least Squares Temporal Difference Learning
    Paschalidis, Ioannis Ch
    Li, Keyong
    Estanjini, Reza Moazzez
    PROCEEDINGS OF THE 48TH IEEE CONFERENCE ON DECISION AND CONTROL, 2009 HELD JOINTLY WITH THE 2009 28TH CHINESE CONTROL CONFERENCE (CDC/CCC 2009), 2009, : 2564 - 2569
  • [28] ESTIMATION OF MULTIPLE SINUSOIDAL FREQUENCIES USING TRUNCATED LEAST-SQUARES METHODS
    HSIEH, SF
    LIU, KJR
    YAO, K
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 1993, 41 (02) : 990 - 994
  • [29] Approximate Dynamic Programming with Recursive Least-Squares Temporal Difference Learning for Adaptive Traffic Signal Control
    Yin, Biao
    Dridi, Mahjoub
    El Moudni, Abdellah
    2015 54TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2015, : 3463 - 3468
  • [30] Optimization of music education strategy guided by the temporal-difference reinforcement learning algorithm
    Su, Yingwei
    Wang, Yuan
    Soft Computing, 2024, 28 (13-14) : 8279 - 8291