Minimax weight learning for absorbing MDPs

被引：0

作者：

Li, Fengying ^{[1
]}

Li, Yuqiang ^{[1
]}

Wu, Xianyi ^{[1
]}

机构：

[1] East China Normal Univ, Sch Stat, KLATASDS MOE, Shanghai 200062, Peoples R China

来源：

STATISTICAL PAPERS | 2024年 / 65卷 / 06期

基金：

国家重点研发计划;

关键词：

Absorbing MDP; Off-policy; Minimax weight learning; Policy evaluation; Occupancy measure; MODELS;

D O I：

10.1007/s00362-023-01491-4

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment

引用

页码：3545 / 3582

页数：38

共 50 条

[1] Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
He, Jiafan
Zhou, Dongruo
Gu, Quanquan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[2] Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
Domingues, Omar Darwiche
Menard, Pierre
Kaufmann, Emilie
Valko, Michal
ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
[3] Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Wu, Yue
Zhou, Dongruo
Gu, Quanquan
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
[4] ε-MDPs:: Learning in varying environments
Szita, I
Takács, B
Lorincz, A
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (01) : 145 - 173
[5] Learning to Branch with Tree MDPs
Scavuzzo, Lara
Chen, Feng Yang
Chetelat, Didier
Gasse, Maxime
Lodi, Andrea
Yorke-Smith, Neil
Aardal, Karen
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Reinforcement learning for MDPs with constraints
Geibel, Peter
MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 646 - 653
[7] Efficient reinforcement learning in factored MDPs
Kearns, M
Koller, D
IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 740 - 747
[8] Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
Sutton, RS
Precup, D
Singh, S
ARTIFICIAL INTELLIGENCE, 1999, 112 (1-2) : 181 - 211
[9] Multitask reinforcement learning on the distribution of MDPs
Tanaka, F
Yamamura, M
2003 IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION, VOLS I-III, PROCEEDINGS, 2003, : 1108 - 1113
[10] Expedited Learning in MDPs with Side Information
Ornik, Melkior
Fu, Jie
Lauffer, Niklas T.
Perera, W. K.
Alshiekh, Mohammed
Ono, Masahiro
Topcu, Ufuk
2018 IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2018, : 1941 - 1948

← 1 2 3 4 5 →