Learning in Constrained Markov Decision Processes

被引：6

作者：

Singh, Rahul ^{[1
]}

Gupta, Abhishek ^{[2
]}

Shroff, Ness B. ^{[2
]}

机构：

[1] Indian Inst Sci, Dept ECE, Bengaluru 560012, India

[2] Ohio State Univ, Dept ECE, Columbus, OH 43210 USA

来源：

IEEE TRANSACTIONS ON CONTROL OF NETWORK SYSTEMS | 2023年 / 10卷 / 01期

关键词：

Costs; Markov processes; Heuristic algorithms; Throughput; Power demand; Network systems; Control systems; Machine learning; Markov decision processes; reinforcement learning; QUEUING-NETWORKS; FLOW-CONTROL;

D O I：

10.1109/TCNS.2022.3203361

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We consider reinforcement learning (RL) in Markov decision processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward and also incurs a cost vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time steps while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds c(i)(ub), i = 1,2, . . . ,M. The consideration of the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner while simultaneously performing the exploration-exploitation tradeoff that is typically encountered in RL tasks. This is challenging since the dual objectives of exploration and exploitation necessarily require the agent to expend resources. In order to measure the performance of an RL algorithm that satisfies the average cost constraints, we define an M+1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the suboptimality in the cumulative reward while the ith component of the cost regret vector is the difference between its ith cumulative cost expense and the expected cost expenditures T c(i)(ub). We prove that the expected value of the regret vector is upper-bounded as (O) over tilde (T-2/3), where T is the time horizon, and (O) over tilde(center dot) hides factors that are logarithmic in T. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers nonepisodic RL under average cost constraints and derives algorithms that can tune the regret vector according to the agent's requirements on its cost regrets.

引用

页码：441 / 453

页数：13

共 50 条

[31] Planning using hierarchical constrained Markov decision processes
Seyedshams Feyzabadi
Stefano Carpin
[J]. Autonomous Robots, 2017, 41 : 1589 - 1607
[32] Learning to Collaborate in Markov Decision Processes
Radanovic, Goran
Devidze, Rati
Parkes, David C.
Singla, Adish
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[33] A Sublinear-Regret Reinforcement Learning Algorithm on Constrained Markov Decision Processes with reset action
Watanabe, Takashi
Sakuragawa, Takashi
[J]. ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 51 - 55
[34] Constrained Multiagent Markov Decision Processes: a Taxonomy of Problems and Algorithms
de Nijs, Frits
Walraven, Erwin
de Weerdt, Mathijs M.
Spaan, Matthijs T. J.
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2021, 70 : 955 - 1001
[35] Constrained Markov Decision Processes with Total Expected Cost Criteria
Altman, Eitan
Boularouk, Said
Josselin, Didier
[J]. PROCEEDINGS OF THE 12TH EAI INTERNATIONAL CONFERENCE ON PERFORMANCE EVALUATION METHODOLOGIES AND TOOLS (VALUETOOLS 2019), 2019, : 191 - 192
[36] Stability-constrained Markov Decision Processes using MPC
Zanon, Mario
Gros, Sebastien
Palladino, Michele
[J]. AUTOMATICA, 2022, 143
[37] An actor-critic algorithm for constrained Markov decision processes
Borkar, VS
[J]. SYSTEMS & CONTROL LETTERS, 2005, 54 (03) : 207 - 213
[38] Non-randomized policies for constrained Markov decision processes
Chen, Richard C.
Feinberg, Eugene A.
[J]. MATHEMATICAL METHODS OF OPERATIONS RESEARCH, 2007, 66 (01) : 165 - 179
[39] DENUMERABLE CONSTRAINED MARKOV DECISION-PROCESSES AND FINITE APPROXIMATIONS
ALTMAN, E
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 1994, 19 (01) : 169 - 191
[40] Potential based optimization algorithm of constrained Markov decision processes
Li Yanjie
Yin Baoqun
Xi Hongsheng
[J]. Proceedings of the 24th Chinese Control Conference, Vols 1 and 2, 2005, : 433 - 436

← 1 2 3 4 5 →