Blackwell Online Learning for Markov Decision Processes

被引：2

作者：

Li, Tao ^{[1
]}

Peng, Guanze ^{[1
]}

Zhu, Quanyan ^{[1
]}

机构：

[1] NYU, Dept Elect & Comp Engn, Brooklyn, NY 11220 USA

来源：

2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS) | 2021年

关键词：

Blackwell approachability; no-regret learning; reinforcement learning; online optimization; STOCHASTIC APPROXIMATIONS; REGRET;

D O I：

10.1109/CISS50987.2021.9400319

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, Based on the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell Q-learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

引用

页数：6

共 50 条

[1] Online Learning in Kernelized Markov Decision Processes
Chowdhury, Sayak Ray
Gopalan, Aditya
[J]. 22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
[2] Online Learning of Safety function for Markov Decision Processes
Mazumdar, Abhijit
Wisniewski, Rafal
Bujorianu, Manuela L.
[J]. 2023 EUROPEAN CONTROL CONFERENCE, ECC, 2023,
[3] Online Learning in Markov Decision Processes with Continuous Actions
Hong, Yi-Te
Lu, Chi-Jen
[J]. ALGORITHMIC LEARNING THEORY, ALT 2015, 2015, 9355 : 302 - 316
[4] Blackwell optimality in Markov decision processes with partial observation
Rosenberg, D
Solan, E
Vieille, N
[J]. ANNALS OF STATISTICS, 2002, 30 (04): : 1178 - 1193
[5] Online Markov Decision Processes
Even-Dar, Eyal
Kakade, Sham M.
Mansour, Yishay
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2009, 34 (03) : 726 - 736
[6] Online Learning in Markov Decision Processes with Changing Cost Sequences
Dick, Travis
Gyorgy, Andras
Szepesvari, Csaba
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 1), 2014, 32
[7] Online Learning with Implicit Exploration in Episodic Markov Decision Processes
Ghasemi, Mahsa
Hashemi, Abolfazl
Vikalo, Haris
Topcu, Ufuk
[J]. 2021 AMERICAN CONTROL CONFERENCE (ACC), 2021, : 1953 - 1958
[8] Blackwell optimality in Markov decision processes with a Borel state space
Yushkevich, AA
[J]. PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 2827 - 2830
[9] Online Learning in Markov Decision Processes with Arbitrarily Changing Rewards and Transitions
Yu, Jia Yuan
Mannor, Shie
[J]. 2009 INTERNATIONAL CONFERENCE ON GAME THEORY FOR NETWORKS (GAMENETS 2009), 2009, : 314 - 322
[10] Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes
Roy, Arghyadip
Borkar, Vivek
Karandikar, Abhay
Chaporkar, Prasanna
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2022, 67 (07) : 3722 - 3729

← 1 2 3 4 5 →