Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning

被引:0
|
作者
Metelli, Alberto Maria [1 ]
Russo, Alessio [1 ]
Restelli, Marcello [1 ]
机构
[1] Politecn Milan, DEIB, Milan, Italy
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Importance Sampling (IS) is a widely used building block for a large variety of off-policy estimation and learning algorithms. However, empirical and theoretical studies have progressively shown that vanilla IS leads to poor estimations whenever the behavioral and target policies are too dissimilar. In this paper, we analyze the theoretical properties of the IS estimator by deriving a novel anticoncentration bound that formalizes the intuition behind its undesired behavior. Then, we propose a new class of IS transformations, based on the notion of power mean. To the best of our knowledge, the resulting estimator is the first to achieve, under certain conditions, two key properties: (i) it displays a subgaussian concentration rate; (ii) it preserves the differentiability in the target distribution. Finally, we provide numerical simulations on both synthetic examples and contextual bandits, in comparison with off-policy evaluation and learning baselines.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Conditional Importance Sampling for Off-Policy Learning
    Rowland, Mark
    Harutyunyan, Anna
    van Hasselt, Hado
    Borsa, Diana
    Schaul, Tom
    Munos, Remi
    Dabney, Will
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 45 - 54
  • [2] Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
    Xie, Tengyang
    Ma, Yifei
    Wang, Yu-Xiang
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [3] Off-Policy Differentiable Logic Reinforcement Learning
    Zhang, Li
    Li, Xin
    Wang, Mingzhong
    Tian, Andong
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT II, 2021, 12976 : 617 - 632
  • [4] Weighted importance sampling for off-policy learning with linear function approximation
    Mahmood, A. Rupam
    Van Hasselt, Hado
    Sutton, Richard S.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [5] Adaptive importance sampling for value function approximation in off-policy reinforcement learning
    Hachiya, Hirotaka
    Akiyama, Takayuki
    Sugiayma, Masashi
    Peters, Jan
    [J]. NEURAL NETWORKS, 2009, 22 (10) : 1399 - 1410
  • [6] Off-policy learning based on weighted importance sampling with linear computational complexity
    Mahmood, A. Rupam
    Sutton, Richard S.
    [J]. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2015, : 552 - 561
  • [7] Mixed experience sampling for off-policy reinforcement learning
    Yu, Jiayu
    Li, Jingyao
    Lu, Shuai
    Han, Shuai
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
  • [8] Off-Policy Evaluation via Off-Policy Classification
    Irpan, Alex
    Rao, Kanishka
    Bousmalis, Konstantinos
    Harris, Chris
    Ibarz, Julian
    Levine, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [9] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [10] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    [J]. Frontiers of Computer Science, 2019, 13 : 911 - 912