Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引：0

作者：

Wang, Yu-Xiang ^{[1
]}

Agarwal, Alekh ^{[2
]}

Dudik, Miroslav ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Microsoft Res, New York, NY 10011 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70 | 2017年 / 70卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

引用

页数：9

共 50 条

[41] On the Design of Estimators for Bandit Off-Policy Evaluation
Vlassis, Nikos
Bibaut, Aurelien
Dimakopoulou, Maria
Jebara, Tony
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[42] Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
Lobo, Elita
Singh, Harvineet
Petrik, Marek
Rudin, Cynthia
Lakkaraju, Himabindu
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1264 - 1274
[43] Off-Policy Evaluation with Policy-Dependent Optimization Response
Guo, Wenshuo
Jordan, Michael I.
Zhou, Angela
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[44] Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning
Liu, Mushuang
Wan, Yan
Lewis, Frank L.
Lopez, Victor G.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5522 - 5533
[45] Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems
Chen, Ci
Lewis, Frank L.
Xie, Kan
Xie, Shengli
Liu, Yilu
AUTOMATICA, 2020, 119
[46] Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
Keramati, Ramtin
Gottesman, Omer
Celi, Leo Anthony
Doshi-Velez, Finale
Brunskill, Emma
CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 397 - 410
[47] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Jiang, Nan
Huang, Jiawei
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[48] Conformal Off-Policy Evaluation in Markov Decision Processes
Foffano, Daniele
Russo, Alessio
Proutiere, Alexandre
2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3087 - 3094
[49] Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
Hanna, Josiah P.
Stone, Peter
Niekum, Scott
AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 538 - 546
[50] Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation
Hanna, Josiah P.
Stone, Peter
Niekum, Scott
THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4933 - 4934

← 1 2 3 4 5 →