Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

被引:0
|
作者
Chen, Liyu [1 ]
Jain, Rahul [1 ]
Luo, Haipeng [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study regret minimization for infinite horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures (O) over tilde (root T) regret and constant constraint violation, where T is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both (O) over tilde (T-2/3). Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with (O) over tilde (T-2/3) regret and constraint violation, which can be further improved to (O) over tilde(root T) via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.
引用
收藏
页数:25
相关论文
共 50 条