Learning Adversarial Markov Decision Processes with Delayed Feedback

被引:0
|
作者
Lancewicki, Tal [1 ]
Rosenberg, Aviv [1 ]
Mansour, Yishay [1 ,2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Google Res, Haifa, Israel
基金
以色列科学基金会; 欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many realworld applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are revealed to the learner only in the end of episode k + d(k), where the delays d(k) are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of root K ++ D under full-information feedback, where K is the number of episodes and D = Sigma(k) d(k) is the total delay. Under bandit feedback, we prove similar , VK D regret assuming the costs are stochastic, and (K + D)(2/3) regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.
引用
收藏
页码:7281 / 7289
页数:9
相关论文
共 50 条
  • [1] Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
    Jin, Chi
    Jin, Tiancheng
    Luo, Haipeng
    Sra, Suvrit
    Yu, Tiancheng
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [2] Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback
    Dai, Yan
    Luo, Haipeng
    Chen, Liyu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [3] Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback
    Zhao, Canzhe
    Yang, Ruofeng
    Wang, Baoxiang
    Zhang, Xuezhou
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] A Matrosov Theorem for Adversarial Markov Decision Processes
    Teel, Andrew R.
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2013, 58 (08) : 2142 - 2148
  • [5] Online Convex Optimization in Adversarial Markov Decision Processes
    Rosenberg, Aviv
    Mansour, Yishay
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [6] Learning to Collaborate in Markov Decision Processes
    Radanovic, Goran
    Devidze, Rati
    Parkes, David C.
    Singla, Adish
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [7] Learning in Constrained Markov Decision Processes
    Singh, Rahul
    Gupta, Abhishek
    Shroff, Ness B.
    [J]. IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS, 2023, 10 (01): : 441 - 453
  • [8] Online Markov Decision Processes Under Bandit Feedback
    Neu, Gergely
    Gyoergy, Andras
    Szepesvari, Csaba
    Antos, Andras
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2014, 59 (03) : 676 - 691
  • [9] Blackwell Online Learning for Markov Decision Processes
    Li, Tao
    Peng, Guanze
    Zhu, Quanyan
    [J]. 2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2021,
  • [10] Online Learning in Kernelized Markov Decision Processes
    Chowdhury, Sayak Ray
    Gopalan, Aditya
    [J]. 22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89