Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

被引：0

作者：

Uehara, Masatoshi ^{[1
,2
]}

Kiyohara, Haruka ^{[2
,7
]}

Bennett, Andrew ^{[2
,3
]}

Chernozhukov, Victor ^{[4
]}

Jiang, Nan ^{[5
]}

Kallus, Nathan ^{[2
]}

Shi, Chengchun ^{[6
]}

Sun, Wen ^{[2
]}

机构：

[1] Genentech Inc, San Francisco, CA 94080 USA

[2] Cornell Univ, Ithaca, NY 14853 USA

[3] Morgan Stanley, New York, NY USA

[4] MIT, Cambridge, MA 02139 USA

[5] UIUC, Champaign, IL USA

[6] LSE, London, England

[7] Tokyo Inst Technol, Tokyo, Japan

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

英国工程与自然科学研究理事会; 美国国家科学基金会;

关键词：

VARIABLES; MODELS; COMPLEXITY;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.

引用

页数：18

共 50 条

[1] Off-Policy Evaluation via Off-Policy Classification
Irpan, Alex
Rao, Kanishka
Bousmalis, Konstantinos
Harris, Chris
Ibarz, Julian
Levine, Sergey
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[2] Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Jiang, Nan
Huang, Jiawei
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[3] Off-Policy Evaluation with Policy-Dependent Optimization Response
Guo, Wenshuo
Jordan, Michael I.
Zhou, Angela
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[4] Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
Jiang, Nan
Li, Lihong
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[5] A Unifying Framework of Off-Policy General Value Function Evaluation
Xu, Tengyu
Yang, Zhuoran
Wang, Zhaoran
Liang, Yingbin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Universal Off-Policy Evaluation
Chandak, Yash
Niekum, Scott
da Silva, Bruno Castro
Learned-Miller, Erik
Brunskill, Emma
Thomas, Philip S.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[7] Off-Policy Evaluation for Human Feedback
Gao, Qitong
Gao, Ge
Dong, Juncheng
Tarokh, Vahid
Chi, Min
Pajic, Miroslav
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[8] Off-policy evaluation for slate recommendation
Swaminathan, Adith
Krishnamurthy, Akshay
Agarwal, Alekh
Dudik, Miroslav
Langford, John
Jose, Damien
Zitouni, Imed
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[9] High Confidence Off-Policy Evaluation
Thomas, Philip S.
Theocharous, Georgios
Ghavamzadeh, Mohammad
PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 3000 - 3006
[10] Chaining Value Functions for Off-Policy Learning
Schmitt, Simon
Shawe-Taylor, John
van Hasselt, Hado
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8187 - 8195

← 1 2 3 4 5 →