RL for Latent MDPs: Regret Guarantees and a Lower Bound

被引：0

作者：

Kwon, Jeongyeol ^{[1
]}

Efroni, Yonathan ^{[2
]}

Caramanis, Constantine ^{[1
]}

Mannor, Shie ^{[3
,4
]}

机构：

[1] Univ Texas Austin, Austin, TX 78712 USA

[2] Microsoft Res, New York, NY USA

[3] Technion, Haifa, Israel

[4] NVIDIA, Santa Clara, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Omega((SA)(M)) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.

引用

页数：12

共 50 条

[41] Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP
Levy, Orin
Mansour, Yishay
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8510 - 8517
[42] Regret Analysis for RL using Renewal Bandit Feedback
Bhatt, Sujay
Fang, Guanhua
Li, Ping
Samorodnitsky, Gennady
2022 IEEE INFORMATION THEORY WORKSHOP (ITW), 2022, : 137 - 142
[43] Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management
Agrawal S.
Jia R.
Operations Research, 2022, 70 (03): : 1646 - 1664
[44] Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management
Agrawal, Shipra
Jia, Randy
OPERATIONS RESEARCH, 2022,
[45] Learning in Structured MDPs with Convex Cost Functions: Improved Regret Bounds for Inventory Management
Agrawal, Shipra
Jia, Randy
ACM EC '19: PROCEEDINGS OF THE 2019 ACM CONFERENCE ON ECONOMICS AND COMPUTATION, 2019, : 743 - 744
[46] PDFA Distillation with Error Bound Guarantees
Baumgartner, Robert
Verwer, Sicco
IMPLEMENTATION AND APPLICATION OF AUTOMATA, CIAA 2024, 2024, 15015 : 51 - 65
[47] Discovery and density estimation of latent confounders in Bayesian networks with evidence lower bound
Chobtham, Kiattikun
Constantinou, Anthony C.
INTERNATIONAL CONFERENCE ON PROBABILISTIC GRAPHICAL MODELS, VOL 186, 2022, 186
[48] Liquid Welfare Guarantees for No-Regret Learning in Sequential Budgeted Auctions
Fikioris, Giannis
Tardos, Eva
MATHEMATICS OF OPERATIONS RESEARCH, 2024,
[49] Efficient processing of k-regret minimization queries with theoretical guarantees
Zheng, Jiping
Dong, Qi
Wang, Xiaoyang
Zhang, Ying
Ma, Wei
Ma, Yuan
INFORMATION SCIENCES, 2022, 586 : 99 - 118
[50] Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees
Tirinzoni, Andrea
Papini, Matteo
Touati, Ahmed
Lazaric, Alessandro
Pirotta, Matteo
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →