Learning Adversarial Markov Decision Processes with Delayed Feedback

被引：0

作者：

Lancewicki, Tal ^{[1
]}

Rosenberg, Aviv ^{[1
]}

Mansour, Yishay ^{[1
,2
]}

机构：

[1] Tel Aviv Univ, Tel Aviv, Israel

[2] Google Res, Haifa, Israel

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

基金：

欧洲研究理事会; 以色列科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many realworld applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are revealed to the learner only in the end of episode k + d(k), where the delays d(k) are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of root K ++ D under full-information feedback, where K is the number of episodes and D = Sigma(k) d(k) is the total delay. Under bandit feedback, we prove similar , VK D regret assuming the costs are stochastic, and (K + D)(2/3) regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.

引用

页码：7281 / 7289

页数：9

共 50 条

[21] Reinforcement Learning in Robust Markov Decision Processes
Lim, Shiau Hong
Xu, Huan
Mannor, Shie
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2016, 41 (04) : 1325 - 1353
[22] Delayed Nondeterminism in Continuous-Time Markov Decision Processes
Neuhaeusser, Martin R.
Stoelinga, Marielle
Katoen, Joost-Pieter
[J]. FOUNDATIONS OF SOFTWARE SCIENCE AND COMPUTATIONAL STRUCTURES, PROCEEDINGS, 2009, 5504 : 364 - +
[23] Framework for solving time-delayed Markov Decision Processes
Sawaya, Yorgo
Issa, George
Marzen, Sarah E.
[J]. PHYSICAL REVIEW RESEARCH, 2023, 5 (03):
[24] Online EXP3 Learning in Adversarial Bandits with Delayed Feedback
Bistritz, Ilai
Zhou, Zhengyuan
Chen, Xi
Bambos, Nicholas
Blanchet, Jose
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[25] Online Learning of Safety function for Markov Decision Processes
Mazumdar, Abhijit
Wisniewski, Rafal
Bujorianu, Manuela L.
[J]. 2023 EUROPEAN CONTROL CONFERENCE, ECC, 2023,
[26] Learning Policies for Markov Decision Processes in Continuous Spaces
Paternain, Santiago
Bazerque, Juan Andres
Small, Austin
Ribeiro, Alejandro
[J]. 2018 IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2018, : 4751 - 4758
[27] Active Learning of Markov Decision Processes for System Verification
Chen, Yingke
Nielsen, Thomas Dyhre
[J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 289 - 294
[28] Active learning in partially observable Markov decision processes
Jaulmes, R
Pineau, J
Precup, D
[J]. MACHINE LEARNING: ECML 2005, PROCEEDINGS, 2005, 3720 : 601 - 608
[29] Learning Policies for Markov Decision Processes From Data
Hanawal, Manjesh Kumar
Liu, Hao
Zhu, Henghui
Paschalidis, Ioannis Ch.
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2019, 64 (06) : 2298 - 2309
[30] Concurrent Markov decision processes for robot team learning
Girard, Justin
Emami, M. Reza
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 39 : 223 - 234

← 1 2 3 4 5 →