Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

被引：0

作者：

Jin, Tiancheng ^{[1
]}

Luo, Haipeng ^{[1
]}

机构：

[1] Univ Southern Calif, Los Angeles, CA 90007 USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a "best-of-both-worlds" guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with (O) over tilde(root T) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves (O) over tilde(root C) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Followthe-Regularized-Leader method from Zimin and Neu [26], with a novel hybrid regularizer inspired by recent works of Zimmert et al. [27, 29] for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

引用

页数：10

共 50 条

[1] The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition
Jin, Tiancheng
Huang, Longbo
Luo, Haipeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[2] Cooperative Online Learning in Stochastic and Adversarial MDPs
Lancewicki, Tal
Rosenberg, Aviv
Mansour, Yishay
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[3] Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback
Kong, Fang
Zhou, Yichi
Li, Shuai
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[4] Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model
Chen, Cheng
Zhao, Canzhe
Li, Shuai
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 6202 - 6210
[5] Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation
Li, Long-Fei
Zhao, Peng
Zhou, Zhi-Hua
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13572 - 13580
[6] Online Learning with Off-Policy Feedback in Adversarial MDPs
Bacchiocchi, Francesco
Stradi, Francesco Emanuele
Papini, Matteo
Metelli, Alberto Maria
Gatti, Nicola
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 3697 - 3705
[7] Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
Domingues, Omar Darwiche
Menard, Pierre
Kaufmann, Emilie
Valko, Michal
ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
[8] Meta Learning MDPs with Linear Transition Models
Mueller, Robert
Pacchiano, Aldo
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151 : 5928 - 5948
[9] Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
Li, Long-Fei
Zhao, Peng
Zhou, Zhi-Hua
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[10] Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously
Zimmert, Julian
Luo, Haipeng
Wei, Chen-Yu
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97

← 1 2 3 4 5 →