Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

被引:0
|
作者
Zhong, Rujie [1 ]
Zhang, Duohan [2 ]
Schafer, Lukas [1 ]
Albrecht, Stefano V. [1 ]
Hanna, Josiah P. [3 ]
机构
[1] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland
[2] Univ Wisconsin Madison, Dept Stat, Madison, WI 53706 USA
[3] Univ Wisconsin Madison, Dept Comp Sci, Madison, WI 53706 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
    Thomas, Philip S.
    Brunskill, Emma
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [2] On-policy concurrent reinforcement learning
    Banerjee, B
    Sen, S
    Peng, J
    [J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2004, 16 (04) : 245 - 260
  • [3] Optimistic Sampling Strategy for Data-Efficient Reinforcement Learning
    Zhao, Dongfang
    Liu, Jiafeng
    Wu, Rui
    Cheng, Dansong
    Tang, Xianglong
    [J]. IEEE ACCESS, 2019, 7 : 55763 - 55769
  • [4] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Saeed Rahimi Gorji
    Ole-Christoffer Granmo
    [J]. Applied Intelligence, 2023, 53 : 8596 - 8613
  • [5] Data-Efficient Policy Evaluation Through Behavior Policy Search
    Hanna, Josiah P.
    Thomas, Philip S.
    Stone, Peter
    Niekum, Scott
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [6] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Gorji, Saeed Rahimi
    Granmo, Ole-Christoffer
    [J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
  • [7] Tabu search exploration for on-policy reinforcement learning
    Abramson, M
    Wechsler, H
    [J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2910 - 2915
  • [8] Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation
    Kallus, Nathan
    Uehara, Masatoshi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [9] A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
    Zhang, Huaqing
    Ma, Hongbin
    Mersha, Bemnet Wondimagegnehu
    Jin, Ying
    [J]. APPLIED INTELLIGENCE, 2024, 54 (21) : 11144 - 11159
  • [10] Data-Efficient Hierarchical Reinforcement Learning
    Nachum, Ofir
    Gu, Shixiang
    Lee, Honglak
    Levine, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31