RL for Latent MDPs: Regret Guarantees and a Lower Bound

被引：0

作者：

Kwon, Jeongyeol ^{[1
]}

Efroni, Yonathan ^{[2
]}

Caramanis, Constantine ^{[1
]}

Mannor, Shie ^{[3
,4
]}

机构：

[1] Univ Texas Austin, Austin, TX 78712 USA

[2] Microsoft Res, New York, NY USA

[3] Technion, Haifa, Israel

[4] NVIDIA, Santa Clara, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Omega((SA)(M)) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.

引用

页数：12

共 50 条

[31] Regret, portfolio choice, and guarantees in defined contribution schemes
Muermann, Alexander
Mitchell, Olivia S.
Volkman, Jacqueline M.
INSURANCE MATHEMATICS & ECONOMICS, 2006, 39 (02): : 219 - 229
[32] Online Learning for Predictive Control with Provable Regret Guarantees
Muthirayan, Deepan
Yuan, Jianjun
Kalathil, Dileep
Khargonekar, Pramod P.
2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 6666 - 6671
[33] No Regret Bound for Extreme Bandits
Nishihara, Robert
Lopez-Paz, David
Bottou, Leon
ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 259 - 267
[34] Nearly Optimal Latent State Decoding in Block MDPs
Jedra, Yassir
Lee, Junghyun
Proutiere, Alexandre
Yun, Se-Young
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
[35] Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
Simchowitz, Max
Jamieson, Kevin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[36] Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)
Ahmed, Asrar
Varakantham, Pradeep
Lowalekar, Meghna
Adulyasak, Yossiri
Jaillet, Patrick
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2017, 59 : 229 - 264
[37] Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
Komiyama, Junpei
Honda, Junya
Nakagawa, Hiroshi
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[38] Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics
Wang, Feicheng
Janson, Lucas
2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 536 - 541
[39] A regret lower bound for assortment optimization under the capacitated MNL model with arbitrary revenue parameters
Peeters, Yannik
den Boer, Arnoud V.
PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2022, 36 (04) : 1266 - 1274
[40] Data-Driven Online Model Selection With Regret Guarantees
Pacchiano, Aldo
Dann, Christoph
Gentile, Claudio
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238

← 1 2 3 4 5 →