Blackwell Online Learning for Markov Decision Processes

被引:2
|
作者
Li, Tao [1 ]
Peng, Guanze [1 ]
Zhu, Quanyan [1 ]
机构
[1] NYU, Dept Elect & Comp Engn, Brooklyn, NY 11220 USA
关键词
Blackwell approachability; no-regret learning; reinforcement learning; online optimization; STOCHASTIC APPROXIMATIONS; REGRET;
D O I
10.1109/CISS50987.2021.9400319
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, Based on the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell Q-learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Online Learning in Kernelized Markov Decision Processes
    Chowdhury, Sayak Ray
    Gopalan, Aditya
    [J]. 22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [2] Online Learning of Safety function for Markov Decision Processes
    Mazumdar, Abhijit
    Wisniewski, Rafal
    Bujorianu, Manuela L.
    [J]. 2023 EUROPEAN CONTROL CONFERENCE, ECC, 2023,
  • [3] Online Learning in Markov Decision Processes with Continuous Actions
    Hong, Yi-Te
    Lu, Chi-Jen
    [J]. ALGORITHMIC LEARNING THEORY, ALT 2015, 2015, 9355 : 302 - 316
  • [4] Blackwell optimality in Markov decision processes with partial observation
    Rosenberg, D
    Solan, E
    Vieille, N
    [J]. ANNALS OF STATISTICS, 2002, 30 (04): : 1178 - 1193
  • [5] Online Markov Decision Processes
    Even-Dar, Eyal
    Kakade, Sham M.
    Mansour, Yishay
    [J]. MATHEMATICS OF OPERATIONS RESEARCH, 2009, 34 (03) : 726 - 736
  • [6] Online Learning in Markov Decision Processes with Changing Cost Sequences
    Dick, Travis
    Gyorgy, Andras
    Szepesvari, Csaba
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 1), 2014, 32
  • [7] Online Learning with Implicit Exploration in Episodic Markov Decision Processes
    Ghasemi, Mahsa
    Hashemi, Abolfazl
    Vikalo, Haris
    Topcu, Ufuk
    [J]. 2021 AMERICAN CONTROL CONFERENCE (ACC), 2021, : 1953 - 1958
  • [8] Blackwell optimality in Markov decision processes with a Borel state space
    Yushkevich, AA
    [J]. PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 2827 - 2830
  • [9] Online Learning in Markov Decision Processes with Arbitrarily Changing Rewards and Transitions
    Yu, Jia Yuan
    Mannor, Shie
    [J]. 2009 INTERNATIONAL CONFERENCE ON GAME THEORY FOR NETWORKS (GAMENETS 2009), 2009, : 314 - 322
  • [10] Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes
    Roy, Arghyadip
    Borkar, Vivek
    Karandikar, Abhay
    Chaporkar, Prasanna
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (07) : 3722 - 3729