Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

被引:0
|
作者
Uehara, Masatoshi [1 ,2 ]
Kiyohara, Haruka [2 ,7 ]
Bennett, Andrew [2 ,3 ]
Chernozhukov, Victor [4 ]
Jiang, Nan [5 ]
Kallus, Nathan [2 ]
Shi, Chengchun [6 ]
Sun, Wen [2 ]
机构
[1] Genentech Inc, San Francisco, CA 94080 USA
[2] Cornell Univ, Ithaca, NY 14853 USA
[3] Morgan Stanley, New York, NY USA
[4] MIT, Cambridge, MA 02139 USA
[5] UIUC, Champaign, IL USA
[6] LSE, London, England
[7] Tokyo Inst Technol, Tokyo, Japan
基金
英国工程与自然科学研究理事会; 美国国家科学基金会;
关键词
VARIABLES; MODELS; COMPLEXITY;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Learning Action Embeddings for Off-Policy Evaluation
    Cief, Matej
    Golebiowski, Jacek
    Schmidt, Philipp
    Abedjan, Ziawasch
    Bekasov, Artur
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 108 - 122
  • [22] A perspective on off-policy evaluation in reinforcement learning
    Lihong Li
    Frontiers of Computer Science, 2019, 13 : 911 - 912
  • [23] Off-Policy Evaluation in Doubly Inhomogeneous Environments
    Bian, Zeyu
    Shi, Chengchun
    Qi, Zhengling
    Wang, Lan
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024,
  • [24] A perspective on off-policy evaluation in reinforcement learning
    Li, Lihong
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (05) : 911 - 912
  • [25] Control Variates for Slate Off-Policy Evaluation
    Vlassis, Nikos
    Chandrashekar, Ashok
    Gil, Fernando Amat
    Kallus, Nathan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [26] Distributional Off-Policy Evaluation for Slate Recommendations
    Chaudhari, Shreyas
    Arbour, David
    Theocharous, Georgios
    Vlassis, Nikos
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 8, 2024, : 8265 - 8273
  • [27] Reliable Off-Policy Evaluation for Reinforcement Learning
    Wang, Jie
    Gao, Rui
    Zha, Hongyuan
    OPERATIONS RESEARCH, 2024, 72 (02) : 699 - 716
  • [28] Handling Confounding for Realistic Off-Policy Evaluation
    Sohoney, Saurabh
    Prabhu, Nikita
    Chaoji, Vineet
    COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 33 - 34
  • [29] Debiased Off-Policy Evaluation for Recommendation Systems
    Narita, Yusuke
    Yasui, Shota
    Yata, Kohei
    15TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS 2021), 2021, : 372 - 379
  • [30] Case-Based Off-Policy Evaluation Using Prototype Learning
    Matsson, Anton
    Johansson, Fredrik D.
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1339 - 1349