Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

被引：0

作者：

Akrour, Riad ^{[1
]}

Abdolmaleki, Abbas ^{[2
]}

Abdulsamad, Hany ^{[1
]}

Peters, Jan ^{[1
,3
]}

Neumann, Gerhard ^{[1
,4
]}

机构：

[1] Tech Univ Darmstadt, CLAS IAS, Hsch Str 10, D-64289 Darmstadt, Germany

[2] DeepMind, London N1C 4AG, England

[3] Max Planck Inst Intelligent Syst, Max Planck Ring 4, Tubingen, Germany

[4] Univ Lincoln, L CAS, Lincoln LN6 7TS, England

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2018年 / 19卷

基金：

欧盟地平线“2020”;

关键词：

Reinforcement Learning; Policy Optimization; Trajectory Optimization; Robotics;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.

引用

页数：25

共 50 条

[31] ON THE USE OF AN SPSA-BASED MODEL-FREE CONTROLLER IN QUALITY IMPROVEMENT
REZAYAT, F
AUTOMATICA, 1995, 31 (06) : 913 - 915
[32] Cascaded Model-Free Control for trajectory tracking of quadrotors
Bekcheva, Maria
Join, Cedric
Mounier, Hugues
2018 INTERNATIONAL CONFERENCE ON UNMANNED AIRCRAFT SYSTEMS (ICUAS), 2018, : 1359 - 1368
[33] A Trajectory-based Attention Model for Sequential Impurity Detection
He, Wenhao
Song, Haitao
Guo, Yue
Wang, Xiaonan
Bian, Guibin
Yuan, Kui
NEUROCOMPUTING, 2020, 410 : 271 - 283
[34] Adaptive UAV-Trajectory Optimization Under Quality of Service Constraints: A Model-Free Solution
Cui, Jingjing
Ding, Zhiguo
Deng, Yansha
Nallanathan, Arumugam
Hanzo, Lajos
IEEE ACCESS, 2020, 8 : 112253 - 112265
[35] IMPROVEMENT OF TRACKING PERFORMANCE IN MODEL-FREE ADAPTIVE CONTROLLER BASED ON MULTI-INNOVATION AND PARTICLE SWARM OPTIMIZATION
Qin, Pinle
Lin, Yan
Chen, Ming
INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2009, 5 (05): : 1367 - 1377
[36] Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning
Chebotar, Yevgen
Hausman, Karol
Zhang, Marvin
Sukhatme, Gaurav
Schaal, Stefan
Levine, Sergey
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[37] Model-free Policy Learning with Reward Gradients
Lan, Qingfong
Tosatto, Samuele
Farrahi, Homayoon
Mahmood, A. Rupam
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[38] Model-free least squares policy iteration
Lagoudakis, MG
Parr, R
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 14, VOLS 1 AND 2, 2002, 14 : 1547 - 1554
[39] Reinforcement learning based model-free optimized trajectory tracking strategy design for an AUV
Duan, Kairong
Fong, Simon
Chen, C. L. Philip
NEUROCOMPUTING, 2022, 469 : 289 - 297
[40] Trajectory Tracking Control for Parafoil Systems Based on the Model-Free Adaptive Control Method
Zhao, Linggong
He, Weiliang
Lv, Feikai
Wang, Xiaoguang
IEEE ACCESS, 2020, 8 : 152620 - 152636

← 1 2 3 4 5 →