Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

被引:0
|
作者
Uehara, Masatoshi [1 ,2 ]
Kiyohara, Haruka [2 ,7 ]
Bennett, Andrew [2 ,3 ]
Chernozhukov, Victor [4 ]
Jiang, Nan [5 ]
Kallus, Nathan [2 ]
Shi, Chengchun [6 ]
Sun, Wen [2 ]
机构
[1] Genentech Inc, San Francisco, CA 94080 USA
[2] Cornell Univ, Ithaca, NY 14853 USA
[3] Morgan Stanley, New York, NY USA
[4] MIT, Cambridge, MA 02139 USA
[5] UIUC, Champaign, IL USA
[6] LSE, London, England
[7] Tokyo Inst Technol, Tokyo, Japan
基金
英国工程与自然科学研究理事会; 美国国家科学基金会;
关键词
VARIABLES; MODELS; COMPLEXITY;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Off-Policy Evaluation in Partially Observable Environments
    Tennenholtz, Guy
    Mannor, Shie
    Shalit, Uri
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10276 - 10283
  • [32] On the Design of Estimators for Bandit Off-Policy Evaluation
    Vlassis, Nikos
    Bibaut, Aurelien
    Dimakopoulou, Maria
    Jebara, Tony
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [33] Off-Policy Interval Estimation with Lipschitz Value Iteration
    Tang, Ziyang
    Feng, Yihao
    Zhang, Na
    Peng, Jian
    Liu, Qiang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [34] Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
    Lobo, Elita
    Singh, Harvineet
    Petrik, Marek
    Rudin, Cynthia
    Lakkaraju, Himabindu
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1264 - 1274
  • [35] Policy-Adaptive Estimator Selection for Off-Policy Evaluation
    Udagawa, Takuma
    Kiyohara, Haruka
    Narita, Yusuke
    Saito, Yuta
    Tateno, Kei
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 10025 - 10033
  • [36] Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
    Keramati, Ramtin
    Gottesman, Omer
    Celi, Leo Anthony
    Doshi-Velez, Finale
    Brunskill, Emma
    CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 397 - 410
  • [37] Value targets in off-policy AlphaZero: a new greedy backup
    Willemsen, Daniel
    Baier, Hendrik
    Kaisers, Michael
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (03): : 1801 - 1814
  • [38] Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
    Wang, Yu-Xiang
    Agarwal, Alekh
    Dudik, Miroslav
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [39] Value targets in off-policy AlphaZero: a new greedy backup
    Daniel Willemsen
    Hendrik Baier
    Michael Kaisers
    Neural Computing and Applications, 2022, 34 : 1801 - 1814
  • [40] Conformal Off-Policy Evaluation in Markov Decision Processes
    Foffano, Daniele
    Russo, Alessio
    Proutiere, Alexandre
    2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3087 - 3094