Learning Adversarial Markov Decision Processes with Delayed Feedback

被引:0
|
作者
Lancewicki, Tal [1 ]
Rosenberg, Aviv [1 ]
Mansour, Yishay [1 ,2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Google Res, Haifa, Israel
基金
欧洲研究理事会; 以色列科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many realworld applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are revealed to the learner only in the end of episode k + d(k), where the delays d(k) are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of root K ++ D under full-information feedback, where K is the number of episodes and D = Sigma(k) d(k) is the total delay. Under bandit feedback, we prove similar , VK D regret assuming the costs are stochastic, and (K + D)(2/3) regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.
引用
收藏
页码:7281 / 7289
页数:9
相关论文
共 50 条
  • [21] Reinforcement Learning in Robust Markov Decision Processes
    Lim, Shiau Hong
    Xu, Huan
    Mannor, Shie
    [J]. MATHEMATICS OF OPERATIONS RESEARCH, 2016, 41 (04) : 1325 - 1353
  • [22] Delayed Nondeterminism in Continuous-Time Markov Decision Processes
    Neuhaeusser, Martin R.
    Stoelinga, Marielle
    Katoen, Joost-Pieter
    [J]. FOUNDATIONS OF SOFTWARE SCIENCE AND COMPUTATIONAL STRUCTURES, PROCEEDINGS, 2009, 5504 : 364 - +
  • [23] Framework for solving time-delayed Markov Decision Processes
    Sawaya, Yorgo
    Issa, George
    Marzen, Sarah E.
    [J]. PHYSICAL REVIEW RESEARCH, 2023, 5 (03):
  • [24] Online EXP3 Learning in Adversarial Bandits with Delayed Feedback
    Bistritz, Ilai
    Zhou, Zhengyuan
    Chen, Xi
    Bambos, Nicholas
    Blanchet, Jose
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [25] Online Learning of Safety function for Markov Decision Processes
    Mazumdar, Abhijit
    Wisniewski, Rafal
    Bujorianu, Manuela L.
    [J]. 2023 EUROPEAN CONTROL CONFERENCE, ECC, 2023,
  • [26] Learning Policies for Markov Decision Processes in Continuous Spaces
    Paternain, Santiago
    Bazerque, Juan Andres
    Small, Austin
    Ribeiro, Alejandro
    [J]. 2018 IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2018, : 4751 - 4758
  • [27] Active Learning of Markov Decision Processes for System Verification
    Chen, Yingke
    Nielsen, Thomas Dyhre
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 289 - 294
  • [28] Active learning in partially observable Markov decision processes
    Jaulmes, R
    Pineau, J
    Precup, D
    [J]. MACHINE LEARNING: ECML 2005, PROCEEDINGS, 2005, 3720 : 601 - 608
  • [29] Learning Policies for Markov Decision Processes From Data
    Hanawal, Manjesh Kumar
    Liu, Hao
    Zhu, Henghui
    Paschalidis, Ioannis Ch.
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2019, 64 (06) : 2298 - 2309
  • [30] Concurrent Markov decision processes for robot team learning
    Girard, Justin
    Emami, M. Reza
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 39 : 223 - 234