Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

被引:0
|
作者
Uehara, Masatoshi [1 ,2 ]
Kiyohara, Haruka [2 ,7 ]
Bennett, Andrew [2 ,3 ]
Chernozhukov, Victor [4 ]
Jiang, Nan [5 ]
Kallus, Nathan [2 ]
Shi, Chengchun [6 ]
Sun, Wen [2 ]
机构
[1] Genentech Inc, San Francisco, CA 94080 USA
[2] Cornell Univ, Ithaca, NY 14853 USA
[3] Morgan Stanley, New York, NY USA
[4] MIT, Cambridge, MA 02139 USA
[5] UIUC, Champaign, IL USA
[6] LSE, London, England
[7] Tokyo Inst Technol, Tokyo, Japan
基金
英国工程与自然科学研究理事会; 美国国家科学基金会;
关键词
VARIABLES; MODELS; COMPLEXITY;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Off-Policy Evaluation via Off-Policy Classification
    Irpan, Alex
    Rao, Kanishka
    Bousmalis, Konstantinos
    Harris, Chris
    Ibarz, Julian
    Levine, Sergey
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [2] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
    Jiang, Nan
    Huang, Jiawei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [3] Off-Policy Evaluation with Policy-Dependent Optimization Response
    Guo, Wenshuo
    Jordan, Michael I.
    Zhou, Angela
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [4] Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
    Jiang, Nan
    Li, Lihong
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [5] A Unifying Framework of Off-Policy General Value Function Evaluation
    Xu, Tengyu
    Yang, Zhuoran
    Wang, Zhaoran
    Liang, Yingbin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Universal Off-Policy Evaluation
    Chandak, Yash
    Niekum, Scott
    da Silva, Bruno Castro
    Learned-Miller, Erik
    Brunskill, Emma
    Thomas, Philip S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [7] Off-Policy Evaluation for Human Feedback
    Gao, Qitong
    Gao, Ge
    Dong, Juncheng
    Tarokh, Vahid
    Chi, Min
    Pajic, Miroslav
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Off-policy evaluation for slate recommendation
    Swaminathan, Adith
    Krishnamurthy, Akshay
    Agarwal, Alekh
    Dudik, Miroslav
    Langford, John
    Jose, Damien
    Zitouni, Imed
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [9] High Confidence Off-Policy Evaluation
    Thomas, Philip S.
    Theocharous, Georgios
    Ghavamzadeh, Mohammad
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
  • [10] Chaining Value Functions for Off-Policy Learning
    Schmitt, Simon
    Shawe-Taylor, John
    van Hasselt, Hado
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8187 - 8195