Learning in Constrained Markov Decision Processes

被引:6
|
作者
Singh, Rahul [1 ]
Gupta, Abhishek [2 ]
Shroff, Ness B. [2 ]
机构
[1] Indian Inst Sci, Dept ECE, Bengaluru 560012, India
[2] Ohio State Univ, Dept ECE, Columbus, OH 43210 USA
来源
关键词
Costs; Markov processes; Heuristic algorithms; Throughput; Power demand; Network systems; Control systems; Machine learning; Markov decision processes; reinforcement learning; QUEUING-NETWORKS; FLOW-CONTROL;
D O I
10.1109/TCNS.2022.3203361
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider reinforcement learning (RL) in Markov decision processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step t, it earns a reward and also incurs a cost vector consisting of M costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of T time steps while simultaneously ensuring that the average values of the M cost expenditures are bounded by agent-specified thresholds c(i)(ub), i = 1,2, . . . ,M. The consideration of the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner while simultaneously performing the exploration-exploitation tradeoff that is typically encountered in RL tasks. This is challenging since the dual objectives of exploration and exploitation necessarily require the agent to expend resources. In order to measure the performance of an RL algorithm that satisfies the average cost constraints, we define an M+1 dimensional regret vector that is composed of its reward regret, and M cost regrets. The reward regret measures the suboptimality in the cumulative reward while the ith component of the cost regret vector is the difference between its ith cumulative cost expense and the expected cost expenditures T c(i)(ub). We prove that the expected value of the regret vector is upper-bounded as (O) over tilde (T-2/3), where T is the time horizon, and (O) over tilde(center dot) hides factors that are logarithmic in T. We further show how to reduce the regret of a desired subset of the M costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers nonepisodic RL under average cost constraints and derives algorithms that can tune the regret vector according to the agent's requirements on its cost regrets.
引用
收藏
页码:441 / 453
页数:13
相关论文
共 50 条
  • [31] Planning using hierarchical constrained Markov decision processes
    Seyedshams Feyzabadi
    Stefano Carpin
    [J]. Autonomous Robots, 2017, 41 : 1589 - 1607
  • [32] Learning to Collaborate in Markov Decision Processes
    Radanovic, Goran
    Devidze, Rati
    Parkes, David C.
    Singla, Adish
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [33] A Sublinear-Regret Reinforcement Learning Algorithm on Constrained Markov Decision Processes with reset action
    Watanabe, Takashi
    Sakuragawa, Takashi
    [J]. ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 51 - 55
  • [34] Constrained Multiagent Markov Decision Processes: a Taxonomy of Problems and Algorithms
    de Nijs, Frits
    Walraven, Erwin
    de Weerdt, Mathijs M.
    Spaan, Matthijs T. J.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2021, 70 : 955 - 1001
  • [35] Constrained Markov Decision Processes with Total Expected Cost Criteria
    Altman, Eitan
    Boularouk, Said
    Josselin, Didier
    [J]. PROCEEDINGS OF THE 12TH EAI INTERNATIONAL CONFERENCE ON PERFORMANCE EVALUATION METHODOLOGIES AND TOOLS (VALUETOOLS 2019), 2019, : 191 - 192
  • [36] Stability-constrained Markov Decision Processes using MPC
    Zanon, Mario
    Gros, Sebastien
    Palladino, Michele
    [J]. AUTOMATICA, 2022, 143
  • [37] An actor-critic algorithm for constrained Markov decision processes
    Borkar, VS
    [J]. SYSTEMS & CONTROL LETTERS, 2005, 54 (03) : 207 - 213
  • [38] Non-randomized policies for constrained Markov decision processes
    Chen, Richard C.
    Feinberg, Eugene A.
    [J]. MATHEMATICAL METHODS OF OPERATIONS RESEARCH, 2007, 66 (01) : 165 - 179
  • [40] Potential based optimization algorithm of constrained Markov decision processes
    Li Yanjie
    Yin Baoqun
    Xi Hongsheng
    [J]. Proceedings of the 24th Chinese Control Conference, Vols 1 and 2, 2005, : 433 - 436