Learning in Constrained Markov Decision Processes

被引:6
|
作者
Singh, Rahul [1 ]
Gupta, Abhishek [2 ]
Shroff, Ness B. [2 ]
机构
[1] Indian Inst Sci, Dept ECE, Bengaluru 560012, India
[2] Ohio State Univ, Dept ECE, Columbus, OH 43210 USA
来源
关键词
Costs; Markov processes; Heuristic algorithms; Throughput; Power demand; Network systems; Control systems; Machine learning; Markov decision processes; reinforcement learning; QUEUING-NETWORKS; FLOW-CONTROL;
D O I
10.1109/TCNS.2022.3203361
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider reinforcement learning (RL) in Markov decision processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward and also incurs a cost vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time steps while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds c(i)(ub), i = 1,2, . . . ,M. The consideration of the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner while simultaneously performing the exploration-exploitation tradeoff that is typically encountered in RL tasks. This is challenging since the dual objectives of exploration and exploitation necessarily require the agent to expend resources. In order to measure the performance of an RL algorithm that satisfies the average cost constraints, we define an M+1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the suboptimality in the cumulative reward while the ith component of the cost regret vector is the difference between its ith cumulative cost expense and the expected cost expenditures T c(i)(ub). We prove that the expected value of the regret vector is upper-bounded as (O) over tilde (T-2/3), where T is the time horizon, and (O) over tilde(center dot) hides factors that are logarithmic in T. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers nonepisodic RL under average cost constraints and derives algorithms that can tune the regret vector according to the agent's requirements on its cost regrets.
引用
收藏
页码:441 / 453
页数:13
相关论文
共 50 条
  • [1] Reinforcement Learning for Constrained Markov Decision Processes
    Gattami, Ather
    Bai, Qinbo
    Aggarwal, Vaneet
    [J]. 24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [2] On constrained Markov decision processes
    Haviv, M
    [J]. OPERATIONS RESEARCH LETTERS, 1996, 19 (01) : 25 - 28
  • [3] Learning algorithms for finite horizon constrained markov decision processes
    Mittal, A.
    Hemachandra, N.
    [J]. JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2007, 3 (03) : 429 - 444
  • [4] Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes
    Brazdil, Tomas
    Chatterjee, Krishnendu
    Novotny, Petr
    Vahala, Jiri
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9794 - 9801
  • [5] Dynamic programming in constrained Markov decision processes
    Piunovskiy, A. B.
    [J]. CONTROL AND CYBERNETICS, 2006, 35 (03): : 645 - 660
  • [6] Robustness of policies in constrained Markov decision processes
    Zadorojniy, A
    Shwartz, A
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2006, 51 (04) : 635 - 638
  • [7] Markov decision processes with constrained stopping times
    Horiguchi, M
    Kurano, M
    Yasuda, M
    [J]. PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 2000, : 706 - 710
  • [8] Relaxation for Constrained Decentralized Markov Decision Processes
    Xu, Jie
    [J]. AAMAS'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS & MULTIAGENT SYSTEMS, 2016, : 1313 - 1314
  • [9] Risk-constrained Markov Decision Processes
    Borkar, Vivek
    Jain, Rahul
    [J]. 49TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2010, : 2664 - 2669
  • [10] Risk-Constrained Markov Decision Processes
    Borkar, Vivek
    Jain, Rahul
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2014, 59 (09) : 2574 - 2579