Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

被引：0

作者：

Jiang, Yuqian ^{[1
]}

Bharadwaj, Suda ^{[2
]}

Wu, Bo ^{[2
]}

Shah, Rishi ^{[1
,3
]}

Topcu, Ufuk ^{[2
]}

Stone, Peter ^{[1
,4
]}

机构：

[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA

[2] Univ Texas Austin, Dept Aerosp Engn & Engn Mech, Austin, TX 78712 USA

[3] Amazon, Seattle, WA USA

[4] Sony AI, Tokyo, Japan

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

引用

页码：7995 / 8003

页数：9

共 50 条

[21] Multiagent reinforcement learning for strictly constrained tasks based on Reward Recorder
Ding, Lifu
Yan, Gangfeng
Liu, Jianing
[J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (11) : 8387 - 8411
[22] Multiagent reinforcement learning for strictly constrained tasks based on Reward Recorder
Ding, Lifu
Yan, Gangfeng
Liu, Jianing
[J]. International Journal of Intelligent Systems, 2022, 37 (11): : 8387 - 8411
[23] Temporal-logic-based intermittent, optimal, and safe continuous-time learning for trajectory tracking
Kanellopoulos, Aris
Fotiadis, Filippos
Sun, Chuangchuang
Xu, Zhe
Vamvoudakis, Kyriakos G.
Topcu, Ufuk
Dixon, Warren E.
[J]. arXiv, 2021,
[24] Reinforcement Learning with Reward Shaping and Hybrid Exploration in Sparse Reward Scenes
Yang, Yulong
Cao, Weihua
Guo, Linwei
Gan, Chao
Wu, Min
[J]. 2023 IEEE 6TH INTERNATIONAL CONFERENCE ON INDUSTRIAL CYBER-PHYSICAL SYSTEMS, ICPS, 2023,
[25] Safe reinforcement learning under temporal logic with reward design and quantum action selection
Cai, Mingyu
Xiao, Shaoping
Li, Junchao
Kan, Zhen
[J]. SCIENTIFIC REPORTS, 2023, 13 (01)
[26] Exploiting Transformer in Sparse Reward Reinforcement Learning for Interpretable Temporal Logic Motion Planning
Zhang, Hao
Wang, Hao
Kan, Zhen
[J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (08) : 4831 - 4838
[27] Safe reinforcement learning under temporal logic with reward design and quantum action selection
Mingyu Cai
Shaoping Xiao
Junchao Li
Zhen Kan
[J]. Scientific Reports, 13
[28] Using Natural Language for Reward Shaping in Reinforcement Learning
Goyal, Prasoon
Niekum, Scott
Mooney, Raymond J.
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 2385 - 2391
[29] Reinforcement online learning to rank with unbiased reward shaping
Zhuang, Shengyao
Qiao, Zhihao
Zuccon, Guido
[J]. INFORMATION RETRIEVAL JOURNAL, 2022, 25 (04): : 386 - 413
[30] Theoretical and Empirical Analysis of Reward Shaping in Reinforcement Learning
Grzes, Marek
Kudenko, Daniel
[J]. EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 337 - 344

← 1 2 3 4 5 →