Average Reward Optimization with Multiple Discounting Reinforcement Learners

被引：2

作者：

Reinke, Chris ^{[1
]}

Uchibe, Eiji ^{[1
,2
]}

Doya, Kenji ^{[1
]}

机构：

[1] Okinawa Inst Sci & Technol, Onna, Okinawa 9040495, Japan

[2] ATR Computat Neurosci Labs, Kyoto 6190288, Japan

来源：

NEURAL INFORMATION PROCESSING, ICONIP 2017, PT I | 2017年 / 10634卷

关键词：

Reinforcement learning; Average reward; Model-free; Value-based; Q-learning; Modular;

D O I：

10.1007/978-3-319-70087-8_81

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Maximization of average reward is a major goal in reinforcement learning. Existing model-free, value-based algorithms such as R-Learning use average adjusted values. We propose a different framework, the Average Reward Independent Gamma Ensemble (AR-IGE). It is based on an ensemble of discounting Q-learning modules with a different discount factor for each module. Existing algorithms only learn the optimal policy and its average reward. In contrast, the AR-IGE learns different policies and their resulting average rewards. We prove the optimality of the AR-IGE in episodic and deterministic problems where rewards are given at several goal states. Furthermore, we show that the AR-IGE outperforms existing algorithms in such problems, especially in situations where policies have to be changed due to changes in the task. The AR-IGE represents a new way to optimize average reward that could lead to further improvements in the field.

引用

页码：789 / 800

页数：12

共 50 条

[41] Reward contrast in delay and probability discounting
Dai, Zhijie
Grace, Randolph C.
Kemp, Simon
[J]. LEARNING & BEHAVIOR, 2009, 37 (03) : 281 - 288
[42] The neural correlates of temporal reward discounting
Scheres, Anouk
de Water, Erik
Mies, Gabry W.
[J]. WILEY INTERDISCIPLINARY REVIEWS-COGNITIVE SCIENCE, 2013, 4 (05) : 523 - 545
[43] Genomic basis of delayed reward discounting
Gray, Joshua C.
Sanchez-Roige, Sandra
de Wit, Harriet
MacKillop, James
Palmer, Abraham A.
[J]. BEHAVIOURAL PROCESSES, 2019, 162 : 157 - 161
[44] Discounting of reward sequences: a test of competing formal models of hyperbolic discounting
Zarr, Noah
Alexander, William H.
Brown, Joshua W.
[J]. FRONTIERS IN PSYCHOLOGY, 2014, 5
[45] Average number of events and average reward
Mi, J
[J]. PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2000, 14 (04) : 485 - 510
[46] Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs
Duc Thien Nguyen
Yeoh, William
Lau, Hoong Chuin
Zilberstein, Shlomo
Zhang, Chongjie
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 1447 - 1455
[47] An average-reward reinforcement learning algorithm for computing bias-optimal policies
Mahadevan, S
[J]. PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, 1996, : 875 - 880
[48] Average reward rates enable motivational transfer across independent reinforcement learning tasks
Aberg, Kristoffer C.
Paz, Rony
[J]. FRONTIERS IN BEHAVIORAL NEUROSCIENCE, 2022, 16
[49] Solving semi-Markov decision problems using average reward reinforcement learning
Das, TK
Gosavi, A
Mahadevan, S
Marchalleck, N
[J]. MANAGEMENT SCIENCE, 1999, 45 (04) : 560 - 574
[50] Scaling model-based average-reward reinforcement learning for product delivery
Proper, Scott
Tadepalli, Prasad
[J]. MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 735 - 742

← 1 2 3 4 5 →