Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

被引：0

作者：

Zhong, Rujie ^{[1
]}

Zhang, Duohan ^{[2
]}

Schafer, Lukas ^{[1
]}

Albrecht, Stefano V. ^{[1
]}

Hanna, Josiah P. ^{[3
]}

机构：

[1] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland

[2] Univ Wisconsin Madison, Dept Stat, Madison, WI 53706 USA

[3] Univ Wisconsin Madison, Dept Comp Sci, Madison, WI 53706 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.

引用

页数：13

共 50 条

[1] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
Thomas, Philip S.
Brunskill, Emma
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[2] On-policy concurrent reinforcement learning
Banerjee, B
Sen, S
Peng, J
[J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2004, 16 (04) : 245 - 260
[3] Optimistic Sampling Strategy for Data-Efficient Reinforcement Learning
Zhao, Dongfang
Liu, Jiafeng
Wu, Rui
Cheng, Dansong
Tang, Xianglong
[J]. IEEE ACCESS, 2019, 7 : 55763 - 55769
[4] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Saeed Rahimi Gorji
Ole-Christoffer Granmo
[J]. Applied Intelligence, 2023, 53 : 8596 - 8613
[5] Data-Efficient Policy Evaluation Through Behavior Policy Search
Hanna, Josiah P.
Thomas, Philip S.
Stone, Peter
Niekum, Scott
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[6] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Gorji, Saeed Rahimi
Granmo, Ole-Christoffer
[J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
[7] Tabu search exploration for on-policy reinforcement learning
Abramson, M
Wechsler, H
[J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2910 - 2915
[8] Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation
Kallus, Nathan
Uehara, Masatoshi
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[9] A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
Zhang, Huaqing
Ma, Hongbin
Mersha, Bemnet Wondimagegnehu
Jin, Ying
[J]. APPLIED INTELLIGENCE, 2024, 54 (21) : 11144 - 11159
[10] Data-Efficient Hierarchical Reinforcement Learning
Nachum, Ofir
Gu, Shixiang
Lee, Honglak
Levine, Sergey
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31

← 1 2 3 4 5 →