Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

被引:0
|
作者
Duan, Yaqi [1 ]
Jia, Zeyu [2 ]
Wang, Mengdi [3 ,4 ]
机构
[1] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ USA
[2] Peking Univ, Sch Math, Beijing, Peoples R China
[3] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ USA
[4] DeepMind, London, England
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper studies the statistical theory of off-policy policy evaluation with function approximation in batch data reinforcement learning problem. We consider a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted chi(2)-divergence over the function class between the long-term distribution of target policy and the distribution of past data. This restricted chi(2)-divergence characterizes the statistical limit of off-policy evaluation, and is both instance-dependent and function-class-dependent. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
    Duan, Yaqi
    Jia, Zeyu
    Wang, Mengdi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [2] Variance-Aware Off-Policy Evaluation with Linear Function Approximation
    Min, Yifei
    Wang, Tianhao
    Zhou, Dongruo
    Gu, Quanquan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [3] Average-Reward Off-Policy Policy Evaluation with Function Approximation
    Zhang, Shangtong
    Wan, Yi
    Sutton, Richard S.
    Whiteson, Shimon
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [4] The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation
    Amortila, Philip
    Jiang, Nan
    Szepesvari, Csaba
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202 : 768 - 790
  • [5] Weighted importance sampling for off-policy learning with linear function approximation
    Mahmood, A. Rupam
    Van Hasselt, Hado
    Sutton, Richard S.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [6] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
    Jiang, Nan
    Huang, Jiawei
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [7] On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation
    Chen, Xiaohong
    Qi, Zhengling
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [8] On the role of overparameterization in off-policy Temporal Difference learning with linear function approximation
    Thomas, Valentin
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Minimax Off-Policy Evaluation for Multi-Armed Bandits
    Ma, Cong
    Zhu, Banghua
    Jiao, Jiantao
    Wainwright, Martin J.
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (08) : 5314 - 5339
  • [10] Generalized gradient emphasis learning for off-policy evaluation and control with function approximation
    Jiaqing Cao
    Quan Liu
    Lan Wu
    Qiming Fu
    Shan Zhong
    [J]. Neural Computing and Applications, 2023, 35 : 23599 - 23616