Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引:0
|
作者
Wang, Yu-Xiang [1 ]
Agarwal, Alekh [2 ]
Dudik, Miroslav [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Microsoft Res, New York, NY 10011 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] On the Design of Estimators for Bandit Off-Policy Evaluation
    Vlassis, Nikos
    Bibaut, Aurelien
    Dimakopoulou, Maria
    Jebara, Tony
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [42] Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
    Lobo, Elita
    Singh, Harvineet
    Petrik, Marek
    Rudin, Cynthia
    Lakkaraju, Himabindu
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1264 - 1274
  • [43] Off-Policy Evaluation with Policy-Dependent Optimization Response
    Guo, Wenshuo
    Jordan, Michael I.
    Zhou, Angela
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning
    Liu, Mushuang
    Wan, Yan
    Lewis, Frank L.
    Lopez, Victor G.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5522 - 5533
  • [45] Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems
    Chen, Ci
    Lewis, Frank L.
    Xie, Kan
    Xie, Shengli
    Liu, Yilu
    AUTOMATICA, 2020, 119
  • [46] Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
    Keramati, Ramtin
    Gottesman, Omer
    Celi, Leo Anthony
    Doshi-Velez, Finale
    Brunskill, Emma
    CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 397 - 410
  • [47] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
    Jiang, Nan
    Huang, Jiawei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [48] Conformal Off-Policy Evaluation in Markov Decision Processes
    Foffano, Daniele
    Russo, Alessio
    Proutiere, Alexandre
    2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3087 - 3094
  • [49] Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
    Hanna, Josiah P.
    Stone, Peter
    Niekum, Scott
    AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 538 - 546
  • [50] Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
    Hanna, Josiah P.
    Stone, Peter
    Niekum, Scott
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4933 - 4934