RL for Latent MDPs: Regret Guarantees and a Lower Bound

被引:0
|
作者
Kwon, Jeongyeol [1 ]
Efroni, Yonathan [2 ]
Caramanis, Constantine [1 ]
Mannor, Shie [3 ,4 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Microsoft Res, New York, NY USA
[3] Technion, Haifa, Israel
[4] NVIDIA, Santa Clara, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Omega((SA)(M)) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Regret, portfolio choice, and guarantees in defined contribution schemes
    Muermann, Alexander
    Mitchell, Olivia S.
    Volkman, Jacqueline M.
    INSURANCE MATHEMATICS & ECONOMICS, 2006, 39 (02): : 219 - 229
  • [32] Online Learning for Predictive Control with Provable Regret Guarantees
    Muthirayan, Deepan
    Yuan, Jianjun
    Kalathil, Dileep
    Khargonekar, Pramod P.
    2022 IEEE 61ST CONFERENCE ON DECISION AND CONTROL (CDC), 2022, : 6666 - 6671
  • [33] No Regret Bound for Extreme Bandits
    Nishihara, Robert
    Lopez-Paz, David
    Bottou, Leon
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 51, 2016, 51 : 259 - 267
  • [34] Nearly Optimal Latent State Decoding in Block MDPs
    Jedra, Yassir
    Lee, Junghyun
    Proutiere, Alexandre
    Yun, Se-Young
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 206, 2023, 206
  • [35] Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
    Simchowitz, Max
    Jamieson, Kevin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [36] Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)
    Ahmed, Asrar
    Varakantham, Pradeep
    Lowalekar, Meghna
    Adulyasak, Yossiri
    Jaillet, Patrick
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2017, 59 : 229 - 264
  • [37] Copeland Dueling Bandit Problem: Regret Lower Bound, Optimal Algorithm, and Computationally Efficient Algorithm
    Komiyama, Junpei
    Honda, Junya
    Nakagawa, Hiroshi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [38] Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics
    Wang, Feicheng
    Janson, Lucas
    2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 536 - 541
  • [39] A regret lower bound for assortment optimization under the capacitated MNL model with arbitrary revenue parameters
    Peeters, Yannik
    den Boer, Arnoud V.
    PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2022, 36 (04) : 1266 - 1274
  • [40] Data-Driven Online Model Selection With Regret Guarantees
    Pacchiano, Aldo
    Dann, Christoph
    Gentile, Claudio
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238