Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

被引：11

作者：

Zhan, Ruohan ^{[1
]}

Hadad, Vitor ^{[1
]}

Hirshberg, David A. ^{[1
]}

Athey, Susan ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

关键词：

contextual bandits; off-policy evaluation; adaptive weighting; variance reduction;

D O I：

10.1145/3447548.3467456

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.

引用

页码：2125 / 2135

页数：11

共 50 条

[1] Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
Wang, Yu-Xiang
Agarwal, Alekh
Dudik, Miroslav
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[2] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
Shimizu, Tatsuhiro
Tanaka, Koichi
Kishimoto, Ren
Kiyohara, Haruka
Nomura, Masahiro
Saito, Yuta
PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741
[3] Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
Taufiq, Muhammad Faaiz
Doucet, Arnaud
Cornish, Rob
Ton, Jean-Francois
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[4] Off-Policy Risk Assessment in Contextual Bandits
Huang, Audrey
Liu Leqi
Lipton, Zachary C.
Azizzadenesheli, Kamyar
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[5] Conformal Off-Policy Prediction in Contextual Bandits
Taufiq, Muhammad Faaiz
Ton, Jean-Francois
Cornish, Rob
Teh, Yee Whye
Doucet, Arnaud
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Optimal Baseline Corrections for Off-Policy Contextual Bandits
Gupta, Shashank
Jeunen, Olivier
Oosterhuis, Harrie
de Rijke, Maarten
PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 722 - 732
[7] Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions
Lee, Haanvid
Lee, Jongmin
Choi, Yunseon
Jeon, Wonseok
Lee, Byung-Jun
Noh, Yung-Kyun
Kim, Kee-Eung
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[8] Off-Policy Evaluation via Off-Policy Classification
Irpan, Alex
Rao, Kanishka
Bousmalis, Konstantinos
Harris, Chris
Ibarz, Julian
Levine, Sergey
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[9] Off-Policy Learning in Contextual Bandits for Remote Electrical Tilt Optimization
Vannella, Filippo
Jeong, Jaeseong
Proutiere, Alexandre
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (01) : 546 - 556
[10] Off-policy Bandits with Deficient Support
Sachdeva, Noveen
Su, Yi
Joachims, Thorsten
KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 965 - 975

← 1 2 3 4 5 →