Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

被引：0

作者：

Uehara, Masatoshi ^{[1
,2
]}

Kiyohara, Haruka ^{[2
,7
]}

Bennett, Andrew ^{[2
,3
]}

Chernozhukov, Victor ^{[4
]}

Jiang, Nan ^{[5
]}

Kallus, Nathan ^{[2
]}

Shi, Chengchun ^{[6
]}

Sun, Wen ^{[2
]}

机构：

[1] Genentech Inc, San Francisco, CA 94080 USA

[2] Cornell Univ, Ithaca, NY 14853 USA

[3] Morgan Stanley, New York, NY USA

[4] MIT, Cambridge, MA 02139 USA

[5] UIUC, Champaign, IL USA

[6] LSE, London, England

[7] Tokyo Inst Technol, Tokyo, Japan

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

英国工程与自然科学研究理事会; 美国国家科学基金会;

关键词：

VARIABLES; MODELS; COMPLEXITY;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope.

引用

页数：18

共 50 条

[31] Off-Policy Evaluation in Partially Observable Environments
Tennenholtz, Guy
Mannor, Shie
Shalit, Uri
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10276 - 10283
[32] On the Design of Estimators for Bandit Off-Policy Evaluation
Vlassis, Nikos
Bibaut, Aurelien
Dimakopoulou, Maria
Jebara, Tony
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[33] Off-Policy Interval Estimation with Lipschitz Value Iteration
Tang, Ziyang
Feng, Yihao
Zhang, Na
Peng, Jian
Liu, Qiang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[34] Data Poisoning Attacks on Off-Policy Policy Evaluation Methods
Lobo, Elita
Singh, Harvineet
Petrik, Marek
Rudin, Cynthia
Lakkaraju, Himabindu
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, VOL 180, 2022, 180 : 1264 - 1274
[35] Policy-Adaptive Estimator Selection for Off-Policy Evaluation
Udagawa, Takuma
Kiyohara, Haruka
Narita, Yusuke
Saito, Yuta
Tateno, Kei
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 10025 - 10033
[36] Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation
Keramati, Ramtin
Gottesman, Omer
Celi, Leo Anthony
Doshi-Velez, Finale
Brunskill, Emma
CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 174, 2022, 174 : 397 - 410
[37] Value targets in off-policy AlphaZero: a new greedy backup
Willemsen, Daniel
Baier, Hendrik
Kaisers, Michael
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (03): : 1801 - 1814
[38] Optimal and Adaptive Off-policy Evaluation in Contextual Bandits
Wang, Yu-Xiang
Agarwal, Alekh
Dudik, Miroslav
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[39] Value targets in off-policy AlphaZero: a new greedy backup
Daniel Willemsen
Hendrik Baier
Michael Kaisers
Neural Computing and Applications, 2022, 34 : 1801 - 1814
[40] Conformal Off-Policy Evaluation in Markov Decision Processes
Foffano, Daniele
Russo, Alessio
Proutiere, Alexandre
2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3087 - 3094

← 1 2 3 4 5 →