Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引：0

作者：

Wang, Yu-Xiang ^{[1
]}

Agarwal, Alekh ^{[2
]}

Dudik, Miroslav ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Microsoft Res, New York, NY 10011 USA

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70 | 2017年 / 70卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

引用

页数：9

共 50 条

[1] Optimal Baseline Corrections for Off-Policy Contextual Bandits
Gupta, Shashank
Jeunen, Olivier
Oosterhuis, Harrie
de Rijke, Maarten
PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 722 - 732
[2] Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits
Zhan, Ruohan
Hadad, Vitor
Hirshberg, David A.
Athey, Susan
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 2125 - 2135
[3] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
Shimizu, Tatsuhiro
Tanaka, Koichi
Kishimoto, Ren
Kiyohara, Haruka
Nomura, Masahiro
Saito, Yuta
PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741
[4] Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
Taufiq, Muhammad Faaiz
Doucet, Arnaud
Cornish, Rob
Ton, Jean-Francois
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Off-Policy Risk Assessment in Contextual Bandits
Huang, Audrey
Liu Leqi
Lipton, Zachary C.
Azizzadenesheli, Kamyar
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[6] Conformal Off-Policy Prediction in Contextual Bandits
Taufiq, Muhammad Faaiz
Ton, Jean-Francois
Cornish, Rob
Teh, Yee Whye
Doucet, Arnaud
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[7] Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions
Lee, Haanvid
Lee, Jongmin
Choi, Yunseon
Jeon, Wonseok
Lee, Byung-Jun
Noh, Yung-Kyun
Kim, Kee-Eung
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[8] Off-Policy Learning in Contextual Bandits for Remote Electrical Tilt Optimization
Vannella, Filippo
Jeong, Jaeseong
Proutiere, Alexandre
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (01) : 546 - 556
[9] Off-policy Bandits with Deficient Support
Sachdeva, Noveen
Su, Yi
Joachims, Thorsten
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 965 - 975
[10] Minimax Off-Policy Evaluation for Multi-Armed Bandits
Ma, Cong
Zhu, Banghua
Jiao, Jiantao
Wainwright, Martin J.
IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (08) : 5314 - 5339

← 1 2 3 4 5 →