On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

被引:0
|
作者
Zhang, Yiming [1 ]
Ross, Keith W. [2 ]
机构
[1] NYU, New York, NY 10003 USA
[2] New York Univ Shanghai, Shanghai, Peoples R China
关键词
MARKOV DECISION-PROCESSES; ALGORITHMS; GO;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Full Gradient Deep Reinforcement Learning for Average-Reward Criterion
    Pagare, Tejas
    Borkar, Vivek
    Avrachenkov, Konstantin
    [J]. LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
  • [2] Robust Average-Reward Reinforcement Learning
    Wang, Yue
    Velasquez, Alvaro
    Atia, George
    Prater-Bennette, Ashley
    Zou, Shaofeng
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2024, 80 : 719 - 803
  • [3] Robust Average-Reward Reinforcement Learning
    Wang, Yue
    Velasquez, Alvaro
    Atia, George
    Prater-Bennette, Ashley
    Zou, Shaofeng
    [J]. Journal of Artificial Intelligence Research, 2024, 80 : 719 - 803
  • [4] Average-Reward Reinforcement Learning with Trust Region Methods
    Ma, Xiaoteng
    Tang, Xiaohang
    Xia, Li
    Yang, Jun
    Zhao, Qianchuan
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2797 - 2803
  • [5] Tuning Local Search by Average-Reward Reinforcement Learning
    Prestwich, Steven
    [J]. LEARNING AND INTELLIGENT OPTIMIZATION, 2008, 5313 : 192 - 205
  • [6] Inverse Reinforcement Learning with the Average Reward Criterion
    Wu, Feiyang
    Ke, Jingyang
    Wu, Anqi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] An Average-Reward Reinforcement Learning Algorithm based on Schweitzer's Transformation
    Li Jianjun
    Ren Jiangong
    Li Yanjie
    [J]. PROCEEDINGS OF THE 31ST CHINESE CONTROL CONFERENCE, 2012, : 2966 - 2970
  • [8] Average-Reward Learning and Planning with Options
    Wan, Yi
    Naik, Abhishek
    Sutton, Richard S.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [9] Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs
    Duc Thien Nguyen
    Yeoh, William
    Lau, Hoong Chuin
    Zilberstein, Shlomo
    Zhang, Chongjie
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 1447 - 1455
  • [10] An average-reward reinforcement learning algorithm for computing bias-optimal policies
    Mahadevan, S
    [J]. PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, 1996, : 875 - 880