Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

被引:0
|
作者
Wang, Yu-Xiang [1 ]
Agarwal, Alekh [2 ]
Dudik, Miroslav [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Microsoft Res, New York, NY 10011 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the off-policy evaluation problem-estimating the value of a target policy using data collected by another policy-under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Optimal Baseline Corrections for Off-Policy Contextual Bandits
    Gupta, Shashank
    Jeunen, Olivier
    Oosterhuis, Harrie
    de Rijke, Maarten
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 722 - 732
  • [2] Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits
    Zhan, Ruohan
    Hadad, Vitor
    Hirshberg, David A.
    Athey, Susan
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 2125 - 2135
  • [3] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
    Shimizu, Tatsuhiro
    Tanaka, Koichi
    Kishimoto, Ren
    Kiyohara, Haruka
    Nomura, Masahiro
    Saito, Yuta
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741
  • [4] Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
    Taufiq, Muhammad Faaiz
    Doucet, Arnaud
    Cornish, Rob
    Ton, Jean-Francois
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Off-Policy Risk Assessment in Contextual Bandits
    Huang, Audrey
    Liu Leqi
    Lipton, Zachary C.
    Azizzadenesheli, Kamyar
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Conformal Off-Policy Prediction in Contextual Bandits
    Taufiq, Muhammad Faaiz
    Ton, Jean-Francois
    Cornish, Rob
    Teh, Yee Whye
    Doucet, Arnaud
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions
    Lee, Haanvid
    Lee, Jongmin
    Choi, Yunseon
    Jeon, Wonseok
    Lee, Byung-Jun
    Noh, Yung-Kyun
    Kim, Kee-Eung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Off-Policy Learning in Contextual Bandits for Remote Electrical Tilt Optimization
    Vannella, Filippo
    Jeong, Jaeseong
    Proutiere, Alexandre
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (01) : 546 - 556
  • [9] Off-policy Bandits with Deficient Support
    Sachdeva, Noveen
    Su, Yi
    Joachims, Thorsten
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 965 - 975
  • [10] Minimax Off-Policy Evaluation for Multi-Armed Bandits
    Ma, Cong
    Zhu, Banghua
    Jiao, Jiantao
    Wainwright, Martin J.
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2022, 68 (08) : 5314 - 5339