Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient

被引:2
|
作者
Tosatto, Samuele [1 ]
Carvalho, Joao [2 ]
Peters, Jan [2 ]
机构
[1] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada
[2] Tech Univ Darmstadt, FG Intelligent Autonomous Syst, D-64289 Darmstadt, Germany
关键词
Mathematical model; Estimation; Kernel; Reinforcement learning; Monte Carlo methods; Task analysis; Closed-form solutions; policy gradient; nonparametric estimation; ITERATION;
D O I
10.1109/TPAMI.2021.3088063
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Off-policy reinforcement learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.
引用
收藏
页码:5996 / 6010
页数:15
相关论文
共 50 条
  • [31] Safe Off-policy Reinforcement Learning Using Barrier Functions
    Marvi, Zahra
    Kiumarsi, Bahare
    [J]. 2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
  • [32] Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
    Yin, Ming
    Wang, Yu-Xiang
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
  • [33] Regret Minimization Experience Replay in Off-Policy Reinforcement Learning
    Liu, Xu-Hui
    Xue, Zhenghai
    Pang, Jing-Cheng
    Jiang, Shengyi
    Xu, Feng
    Yu, Yang
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [34] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
    Weiwei Wang
    Yuqiang Li
    Xianyi Wu
    [J]. Statistics and Computing, 2024, 34
  • [35] Off-Policy Policy Gradient with State Distribution Correction
    Liu, Yao
    Swaminathan, Adith
    Agarwal, Alekh
    Brunskill, Emma
    [J]. 35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 1180 - 1190
  • [36] Rethinking Population-assisted Off-policy Reinforcement Learning
    Zheng, Bowen
    Cheng, Ran
    [J]. PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 624 - 632
  • [37] Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games
    Li, Jinna
    Modares, Hamidreza
    Chai, Tianyou
    Lewis, Frank L.
    Xie, Lihua
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (10) : 2434 - 2445
  • [38] Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning
    Shi, Wenjie
    Song, Shiji
    Wu, Hui
    Hsu, Ya-Chu
    Wu, Cheng
    Huang, Gao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [39] Stabilizing Off-Policy Deep Reinforcement Learning from Pixels
    Cetin, Edoardo
    Ball, Philip J.
    Roberts, Steve
    Celiktutan, Oya
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [40] Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation
    Kallus, Nathan
    Uehara, Masatoshi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119