Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization

被引:3
|
作者
Hou, Liwei [1 ]
Wang, Hengsheng [1 ,2 ]
Zou, Haoran [1 ]
Wang, Qun [1 ,3 ]
机构
[1] Cent South Univ, Coll Mech & Elect Engn, Changsha 410083, Peoples R China
[2] Cent South Univ, State Key Lab High Performance Complex Mfg, Changsha 410083, Peoples R China
[3] Hunan Univ, Modern Engn Training Ctr, Changsha 410082, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 03期
关键词
robot skills learning; policy learning; policy gradient; experience; data efficiency; LOCOMOTION;
D O I
10.3390/app11031131
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning.
引用
收藏
页码:1 / 20
页数:18
相关论文
共 50 条
  • [1] Advanced Policy Learning Near-Optimal Regulation
    Ding Wang
    Xiangnan Zhong
    IEEE/CAA Journal of Automatica Sinica, 2019, 6 (03) : 743 - 749
  • [2] Advanced Policy Learning Near-Optimal Regulation
    Wang, Ding
    Zhong, Xiangnan
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2019, 6 (03) : 743 - 749
  • [3] Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs
    He, Jiafan
    Zhou, Dongruo
    Gu, Quanquan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [4] Near-Optimal Weighted Matrix Completion
    Lopez, Oscar
    JOURNAL OF MACHINE LEARNING RESEARCH, 2023, 24
  • [5] Efficient, near-optimal control allocation
    Durham, WC
    JOURNAL OF GUIDANCE CONTROL AND DYNAMICS, 1999, 22 (02) : 369 - 372
  • [6] Efficient, near-optimal control allocation
    Virginia Polytechnic Inst and State, Univ, Blacksburg, United States
    J Guid Control Dyn, 2 (369-372):
  • [7] Efficient and Near-Optimal Smoothed Online Learning for Generalized Linear Functions
    Block, Adam
    Simchowitz, Max
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Learning Near-optimal Decision Rules for Energy Efficient Building Control
    Domahidi, Alexander
    Ullmann, Fabian
    Morari, Manfred
    Jones, Colin N.
    2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2012, : 7571 - 7576
  • [9] Safe Learning for Near-Optimal Scheduling
    Busatto-Gaston, Damien
    Chakraborty, Debraj
    Guha, Shibashis
    Perez, Guillermo A.
    Raskin, Jean-Francois
    QUANTITATIVE EVALUATION OF SYSTEMS (QEST 2021), 2021, 12846 : 235 - 254
  • [10] Near-Optimal Collaborative Learning in Bandits
    Reda, Clemence
    Vakili, Sattar
    Kaufmann, Emilie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,