Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient

被引：2

作者：

Tosatto, Samuele ^{[1
]}

Carvalho, Joao ^{[2
]}

Peters, Jan ^{[2
]}

机构：

[1] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2R3, Canada

[2] Tech Univ Darmstadt, FG Intelligent Autonomous Syst, D-64289 Darmstadt, Germany

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2022年 / 44卷 / 10期

关键词：

Mathematical model; Estimation; Kernel; Reinforcement learning; Monte Carlo methods; Task analysis; Closed-form solutions; policy gradient; nonparametric estimation; ITERATION;

D O I：

10.1109/TPAMI.2021.3088063

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Off-policy reinforcement learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

引用

页码：5996 / 6010

页数：15

共 50 条

[31] Safe Off-policy Reinforcement Learning Using Barrier Functions
Marvi, Zahra
Kiumarsi, Bahare
[J]. 2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 2176 - 2181
[32] Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning
Yin, Ming
Wang, Yu-Xiang
[J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108
[33] Regret Minimization Experience Replay in Off-Policy Reinforcement Learning
Liu, Xu-Hui
Xue, Zhenghai
Pang, Jing-Cheng
Jiang, Shengyi
Xu, Feng
Yu, Yang
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[34] Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
Weiwei Wang
Yuqiang Li
Xianyi Wu
[J]. Statistics and Computing, 2024, 34
[35] Off-Policy Policy Gradient with State Distribution Correction
Liu, Yao
Swaminathan, Adith
Agarwal, Alekh
Brunskill, Emma
[J]. 35TH UNCERTAINTY IN ARTIFICIAL INTELLIGENCE CONFERENCE (UAI 2019), 2020, 115 : 1180 - 1190
[36] Rethinking Population-assisted Off-policy Reinforcement Learning
Zheng, Bowen
Cheng, Ran
[J]. PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, GECCO 2023, 2023, : 624 - 632
[37] Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games
Li, Jinna
Modares, Hamidreza
Chai, Tianyou
Lewis, Frank L.
Xie, Lihua
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (10) : 2434 - 2445
[38] Regularized Anderson Acceleration for Off-Policy Deep Reinforcement Learning
Shi, Wenjie
Song, Shiji
Wu, Hui
Hsu, Ya-Chu
Wu, Cheng
Huang, Gao
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[39] Stabilizing Off-Policy Deep Reinforcement Learning from Pixels
Cetin, Edoardo
Ball, Philip J.
Roberts, Steve
Celiktutan, Oya
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[40] Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation
Kallus, Nathan
Uehara, Masatoshi
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119

← 1 2 3 4 5 →