On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

被引：0

作者：

Zhang, Yiming ^{[1
]}

Ross, Keith W. ^{[2
]}

机构：

[1] NYU, New York, NY 10003 USA

[2] New York Univ Shanghai, Shanghai, Peoples R China

来源：

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | 2021年 / 139卷

关键词：

MARKOV DECISION-PROCESSES; ALGORITHMS; GO;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

引用

页数：11

共 50 条

[1] Full Gradient Deep Reinforcement Learning for Average-Reward Criterion
Pagare, Tejas
Borkar, Vivek
Avrachenkov, Konstantin
[J]. LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211
[2] Robust Average-Reward Reinforcement Learning
Wang, Yue
Velasquez, Alvaro
Atia, George
Prater-Bennette, Ashley
Zou, Shaofeng
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2024, 80 : 719 - 803
[3] Robust Average-Reward Reinforcement Learning
Wang, Yue
Velasquez, Alvaro
Atia, George
Prater-Bennette, Ashley
Zou, Shaofeng
[J]. Journal of Artificial Intelligence Research, 2024, 80 : 719 - 803
[4] Average-Reward Reinforcement Learning with Trust Region Methods
Ma, Xiaoteng
Tang, Xiaohang
Xia, Li
Yang, Jun
Zhao, Qianchuan
[J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2797 - 2803
[5] Tuning Local Search by Average-Reward Reinforcement Learning
Prestwich, Steven
[J]. LEARNING AND INTELLIGENT OPTIMIZATION, 2008, 5313 : 192 - 205
[6] Inverse Reinforcement Learning with the Average Reward Criterion
Wu, Feiyang
Ke, Jingyang
Wu, Anqi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] An Average-Reward Reinforcement Learning Algorithm based on Schweitzer's Transformation
Li Jianjun
Ren Jiangong
Li Yanjie
[J]. PROCEEDINGS OF THE 31ST CHINESE CONTROL CONFERENCE, 2012, : 2966 - 2970
[8] Average-Reward Learning and Planning with Options
Wan, Yi
Naik, Abhishek
Sutton, Richard S.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[9] Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs
Duc Thien Nguyen
Yeoh, William
Lau, Hoong Chuin
Zilberstein, Shlomo
Zhang, Chongjie
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 1447 - 1455
[10] An average-reward reinforcement learning algorithm for computing bias-optimal policies
Mahadevan, S
[J]. PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, 1996, : 875 - 880

← 1 2 3 4 5 →