RL for Latent MDPs: Regret Guarantees and a Lower Bound

被引：0

作者：

Kwon, Jeongyeol ^{[1
]}

Efroni, Yonathan ^{[2
]}

Caramanis, Constantine ^{[1
]}

Mannor, Shie ^{[3
,4
]}

机构：

[1] Univ Texas Austin, Austin, TX 78712 USA

[2] Microsoft Res, New York, NY USA

[3] Technion, Haifa, Israel

[4] NVIDIA, Santa Clara, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Omega((SA)(M)) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.

引用

页数：12

共 50 条

[1] Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning
Ouhamma, Reda
Basu, Debabrota
Maillard, Odalric
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9336 - 9344
[2] A Lower Bound for Regret in Logistic Regression
Shamir, Gil, I
Szpankowski, Wojciech
2021 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2021, : 2507 - 2512
[3] Dynamic Regret of Adversarial Linear Mixture MDPs
Li, Long-Fei
Zhao, Peng
Zhou, Zhi-Hua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[4] An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
Wang, Yuanhao
Wang, Ruosong
Kakade, Sham M.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[5] Refined Regret for Adversarial MDPs with Linear Function Approximation
Dai, Yan
Luo, Haipeng
Wei, Chen-Yu
Zimmert, Julian
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
[6] Regret Minimization in MDPs with Options without Prior Knowledge
Fruit, Ronan
Pirotta, Matteo
Lazaric, Alessandro
Brunskill, Emma
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[7] Nash Regret Guarantees for Linear Bandits
Sawarni, Ayush
Pal, Soumyabrata
Barman, Siddharth
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[8] Regret Guarantees for Online Deep Control
Chen, Xinyi
Minasyan, Edgar
Lee, Jason D.
Hazan, Elad
LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
[9] On the Guarantees of Minimizing Regret in Receding Horizon
Martin, Andrea
Furieri, Luca
Dorfler, Florian
Lygeros, John
Ferrari-Trecate, Giancarlo
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2025, 70 (03) : 1547 - 1562
[10] Proximal Online Gradient Is Optimum for Dynamic Regret: A General Lower Bound
Zhao, Yawei
Qiu, Shuang
Li, Kuan
Luo, Lailong
Yin, Jianping
Liu, Ji
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (12) : 7755 - 7764

← 1 2 3 4 5 →