RL for Latent MDPs: Regret Guarantees and a Lower Bound

被引:0
|
作者
Kwon, Jeongyeol [1 ]
Efroni, Yonathan [2 ]
Caramanis, Constantine [1 ]
Mannor, Shie [3 ,4 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
[2] Microsoft Res, New York, NY USA
[3] Technion, Haifa, Israel
[4] NVIDIA, Santa Clara, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Omega((SA)(M)) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning
    Ouhamma, Reda
    Basu, Debabrota
    Maillard, Odalric
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9336 - 9344
  • [2] A Lower Bound for Regret in Logistic Regression
    Shamir, Gil, I
    Szpankowski, Wojciech
    2021 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), 2021, : 2507 - 2512
  • [3] Dynamic Regret of Adversarial Linear Mixture MDPs
    Li, Long-Fei
    Zhao, Peng
    Zhou, Zhi-Hua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
    Wang, Yuanhao
    Wang, Ruosong
    Kakade, Sham M.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [5] Refined Regret for Adversarial MDPs with Linear Function Approximation
    Dai, Yan
    Luo, Haipeng
    Wei, Chen-Yu
    Zimmert, Julian
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [6] Regret Minimization in MDPs with Options without Prior Knowledge
    Fruit, Ronan
    Pirotta, Matteo
    Lazaric, Alessandro
    Brunskill, Emma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [7] Nash Regret Guarantees for Linear Bandits
    Sawarni, Ayush
    Pal, Soumyabrata
    Barman, Siddharth
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Regret Guarantees for Online Deep Control
    Chen, Xinyi
    Minasyan, Edgar
    Lee, Jason D.
    Hazan, Elad
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
  • [9] On the Guarantees of Minimizing Regret in Receding Horizon
    Martin, Andrea
    Furieri, Luca
    Dorfler, Florian
    Lygeros, John
    Ferrari-Trecate, Giancarlo
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2025, 70 (03) : 1547 - 1562
  • [10] Proximal Online Gradient Is Optimum for Dynamic Regret: A General Lower Bound
    Zhao, Yawei
    Qiu, Shuang
    Li, Kuan
    Luo, Lailong
    Yin, Jianping
    Liu, Ji
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (12) : 7755 - 7764