Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

被引:11
|
作者
Zhan, Ruohan [1 ]
Hadad, Vitor [1 ]
Hirshberg, David A. [1 ]
Athey, Susan [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
关键词
contextual bandits; off-policy evaluation; adaptive weighting; variance reduction;
D O I
10.1145/3447548.3467456
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance. In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.
引用
收藏
页码:2125 / 2135
页数:11
相关论文
共 50 条
  • [1] Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
    Wang, Yu-Xiang
    Agarwal, Alekh
    Dudik, Miroslav
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [2] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits
    Shimizu, Tatsuhiro
    Tanaka, Koichi
    Kishimoto, Ren
    Kiyohara, Haruka
    Nomura, Masahiro
    Saito, Yuta
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 733 - 741
  • [3] Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
    Taufiq, Muhammad Faaiz
    Doucet, Arnaud
    Cornish, Rob
    Ton, Jean-Francois
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Off-Policy Risk Assessment in Contextual Bandits
    Huang, Audrey
    Liu Leqi
    Lipton, Zachary C.
    Azizzadenesheli, Kamyar
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Conformal Off-Policy Prediction in Contextual Bandits
    Taufiq, Muhammad Faaiz
    Ton, Jean-Francois
    Cornish, Rob
    Teh, Yee Whye
    Doucet, Arnaud
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Optimal Baseline Corrections for Off-Policy Contextual Bandits
    Gupta, Shashank
    Jeunen, Olivier
    Oosterhuis, Harrie
    de Rijke, Maarten
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 722 - 732
  • [7] Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions
    Lee, Haanvid
    Lee, Jongmin
    Choi, Yunseon
    Jeon, Wonseok
    Lee, Byung-Jun
    Noh, Yung-Kyun
    Kim, Kee-Eung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Off-Policy Evaluation via Off-Policy Classification
    Irpan, Alex
    Rao, Kanishka
    Bousmalis, Konstantinos
    Harris, Chris
    Ibarz, Julian
    Levine, Sergey
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [9] Off-Policy Learning in Contextual Bandits for Remote Electrical Tilt Optimization
    Vannella, Filippo
    Jeong, Jaeseong
    Proutiere, Alexandre
    IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, 2023, 72 (01) : 546 - 556
  • [10] Off-policy Bandits with Deficient Support
    Sachdeva, Noveen
    Su, Yi
    Joachims, Thorsten
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 965 - 975