Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

被引：0

作者：

Duan, Yaqi ^{[1
]}

Jia, Zeyu ^{[2
]}

Wang, Mengdi ^{[3
,4
]}

机构：

[1] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ USA

[2] Peking Univ, Sch Math, Beijing, Peoples R China

[3] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ USA

[4] DeepMind, London, England

来源：

25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019) | 2019年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper studies the statistical theory of off-policy policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted chi(2)-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted chi(2)-divergence characterizes the statistical limit of off-policy evaluation, and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

引用

页数：9

共 50 条

[1] Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
Duan, Yaqi
Jia, Zeyu
Wang, Mengdi
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
[2] Variance-Aware Off-Policy Evaluation with Linear Function Approximation
Min, Yifei
Wang, Tianhao
Zhou, Dongruo
Gu, Quanquan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[3] Average-Reward Off-Policy Policy Evaluation with Function Approximation
Zhang, Shangtong
Wan, Yi
Sutton, Richard S.
Whiteson, Shimon
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
[4] The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation
Amortila, Philip
Jiang, Nan
Szepesvari, Csaba
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202 : 768 - 790
[5] Weighted importance sampling for off-policy learning with linear function approximation
Mahmood, A. Rupam
Van Hasselt, Hado
Sutton, Richard S.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
[6] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Jiang, Nan
Huang, Jiawei
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[7] On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation
Chen, Xiaohong
Qi, Zhengling
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[8] On the role of overparameterization in off-policy Temporal Difference learning with linear function approximation
Thomas, Valentin
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[9] Minimax Off-Policy Evaluation for Multi-Armed Bandits
Ma, Cong
Zhu, Banghua
Jiao, Jiantao
Wainwright, Martin J.
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (08) : 5314 - 5339
[10] Generalized gradient emphasis learning for off-policy evaluation and control with function approximation
Jiaqing Cao
Quan Liu
Lan Wu
Qiming Fu
Shan Zhong
[J]. Neural Computing and Applications, 2023, 35 : 23599 - 23616

← 1 2 3 4 5 →