Learning in Constrained Markov Decision Processes

被引：6

作者：

Singh, Rahul ^{[1
]}

Gupta, Abhishek ^{[2
]}

Shroff, Ness B. ^{[2
]}

机构：

[1] Indian Inst Sci, Dept ECE, Bengaluru 560012, India

[2] Ohio State Univ, Dept ECE, Columbus, OH 43210 USA

来源：

IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS | 2023年 / 10卷 / 01期

关键词：

Costs; Markov processes; Heuristic algorithms; Throughput; Power demand; Network systems; Control systems; Machine learning; Markov decision processes; reinforcement learning; QUEUING-NETWORKS; FLOW-CONTROL;

D O I：

10.1109/TCNS.2022.3203361

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider reinforcement learning (RL) in Markov decision processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward and also incurs a cost vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time steps while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds c(i)(ub), i = 1,2, . . . ,M. The consideration of the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner while simultaneously performing the exploration-exploitation tradeoff that is typically encountered in RL tasks. This is challenging since the dual objectives of exploration and exploitation necessarily require the agent to expend resources. In order to measure the performance of an RL algorithm that satisfies the average cost constraints, we define an M+1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the suboptimality in the cumulative reward while the ith component of the cost regret vector is the difference between its ith cumulative cost expense and the expected cost expenditures T c(i)(ub). We prove that the expected value of the regret vector is upper-bounded as (O) over tilde (T-2/3), where T is the time horizon, and (O) over tilde(center dot) hides factors that are logarithmic in T. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers nonepisodic RL under average cost constraints and derives algorithms that can tune the regret vector according to the agent's requirements on its cost regrets.

引用

页码：441 / 453

页数：13

共 50 条

[1] Reinforcement Learning for Constrained Markov Decision Processes
Gattami, Ather
Bai, Qinbo
Aggarwal, Vaneet
[J]. 24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[2] On constrained Markov decision processes
Haviv, M
[J]. OPERATIONS RESEARCH LETTERS, 1996, 19 (01) : 25 - 28
[3] Learning algorithms for finite horizon constrained markov decision processes
Mittal, A.
Hemachandra, N.
[J]. JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2007, 3 (03) : 429 - 444
[4] Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes
Brazdil, Tomas
Chatterjee, Krishnendu
Novotny, Petr
Vahala, Jiri
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9794 - 9801
[5] Dynamic programming in constrained Markov decision processes
Piunovskiy, A. B.
[J]. CONTROL AND CYBERNETICS, 2006, 35 (03): : 645 - 660
[6] Robustness of policies in constrained Markov decision processes
Zadorojniy, A
Shwartz, A
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2006, 51 (04) : 635 - 638
[7] Markov decision processes with constrained stopping times
Horiguchi, M
Kurano, M
Yasuda, M
[J]. PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 2000, : 706 - 710
[8] Relaxation for Constrained Decentralized Markov Decision Processes
Xu, Jie
[J]. AAMAS'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2016, : 1313 - 1314
[9] Risk-constrained Markov Decision Processes
Borkar, Vivek
Jain, Rahul
[J]. 49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2010, : 2664 - 2669
[10] Risk-Constrained Markov Decision Processes
Borkar, Vivek
Jain, Rahul
[J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2014, 59 (09) : 2574 - 2579

← 1 2 3 4 5 →