Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

被引:0
|
作者
Jin, Tiancheng [1 ]
Luo, Haipeng [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a "best-of-both-worlds" guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with (O) over tilde(root T) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves (O) over tilde(root C) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Followthe-Regularized-Leader method from Zimin and Neu [26], with a novel hybrid regularizer inspired by recent works of Zimmert et al. [27, 29] for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
    Jin, Tiancheng
    Huang, Longbo
    Luo, Haipeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [2] Cooperative Online Learning in Stochastic and Adversarial MDPs
    Lancewicki, Tal
    Rosenberg, Aviv
    Mansour, Yishay
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback
    Kong, Fang
    Zhou, Yichi
    Li, Shuai
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [4] Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model
    Chen, Cheng
    Zhao, Canzhe
    Li, Shuai
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 6202 - 6210
  • [5] Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation
    Li, Long-Fei
    Zhao, Peng
    Zhou, Zhi-Hua
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13572 - 13580
  • [6] Online Learning with Off-Policy Feedback in Adversarial MDPs
    Bacchiocchi, Francesco
    Stradi, Francesco Emanuele
    Papini, Matteo
    Metelli, Alberto Maria
    Gatti, Nicola
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 3697 - 3705
  • [7] Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
    Domingues, Omar Darwiche
    Menard, Pierre
    Kaufmann, Emilie
    Valko, Michal
    ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
  • [8] Meta Learning MDPs with Linear Transition Models
    Mueller, Robert
    Pacchiano, Aldo
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5928 - 5948
  • [9] Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
    Li, Long-Fei
    Zhao, Peng
    Zhou, Zhi-Hua
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [10] Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously
    Zimmert, Julian
    Luo, Haipeng
    Wei, Chen-Yu
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97