Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

被引:0
|
作者
Xie, Tengyang [1 ]
Ma, Yifei [2 ]
Wang, Yu-Xiang [3 ]
机构
[1] UIUC, Dept Comp Sci, Urbana, IL 61801 USA
[2] Amazoncom Serv Inc, AWS AI Labs, East Palo Alto, CA 94303 USA
[3] UC Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) - the problem of evaluating a new policy using the historical data obtained by different behavior policies - under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of 1/n Sigma(H)(t=1) E-mu [d(t)(pi)(st)(2)/d(t)(mu)(s(t))(2) Var(mu) [pi/t(a(t)vertical bar s(t))/mu(t)(a(t)vertical bar s(t))(V-t+1(pi)(s(t+1)) + r(t)vertical bar s(t)]] + (O) over tilde (n(-1.5)) where mu and pi are the logging and target policies, d(t)(mu)(s(t)) and d(t)(pi)(st) are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and V-t+1(pi) is the value function of the MDP under pi. The result matches the Cramer-Rao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H. Besides theory, we show empirical superiority of our method in time -varying, partially observable, and long -horizon RL environments.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Marginalized Operators for Off-policy Reinforcement Learning
    Tang, Yunhao
    Rowland, Mark
    Munos, Remi
    Valko, Michal
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 655 - 679
  • [2] Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning
    Metelli, Alberto Maria
    Russo, Alessio
    Restelli, Marcello
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [3] Conditional Importance Sampling for Off-Policy Learning
    Rowland, Mark
    Harutyunyan, Anna
    van Hasselt, Hado
    Borsa, Diana
    Schaul, Tom
    Munos, Remi
    Dabney, Will
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 45 - 54
  • [4] Adaptive importance sampling for value function approximation in off-policy reinforcement learning
    Hachiya, Hirotaka
    Akiyama, Takayuki
    Sugiayma, Masashi
    Peters, Jan
    [J]. NEURAL NETWORKS, 2009, 22 (10) : 1399 - 1410
  • [5] Mixed experience sampling for off-policy reinforcement learning
    Yu, Jiayu
    Li, Jingyao
    Lu, Shuai
    Han, Shuai
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
  • [6] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [7] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    [J]. Frontiers of Computer Science, 2019, 13 : 911 - 912
  • [8] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    [J]. OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [9] Research on Off-Policy Evaluation in Reinforcement Learning: A Survey
    Wang, Shuo-Ru
    Niu, Wen-Jia
    Tong, En-Dong
    Chen, Tong
    Li, He
    Tian, Yun-Zhe
    Liu, Ji-Qiang
    Han, Zhen
    Li, Yi-Dong
    [J]. Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (09): : 1926 - 1945
  • [10] Marginalized Importance Sampling for Off-Environment Policy Evaluation
    Katdare, Pulkit
    Jiang, Nan
    Driggs-Campbell, Katherine
    [J]. CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229