Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks

被引：0

作者：

Jiang, Yuqian ^{[1
]}

Bharadwaj, Suda ^{[2
]}

Wu, Bo ^{[2
]}

Shah, Rishi ^{[1
,3
]}

Topcu, Ufuk ^{[2
]}

Stone, Peter ^{[1
,4
]}

机构：

[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA

[2] Univ Texas Austin, Dept Aerosp Engn & Engn Mech, Austin, TX 78712 USA

[3] Amazon, Seattle, WA USA

[4] Sony AI, Tokyo, Japan

来源：

THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2021年 / 35卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

引用

页码：7995 / 8003

页数：9

共 50 条

[31] Reinforcement online learning to rank with unbiased reward shaping
Shengyao Zhuang
Zhihao Qiao
Guido Zuccon
[J]. Information Retrieval Journal, 2022, 25 : 386 - 413
[32] Maximizing the average reward in episodic reinforcement learning tasks
Reinke, Chris
Uchibe, Eiji
Doya, Kenji
[J]. 2015 INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS), 2015, : 420 - 421
[33] Structured Reward Shaping using Signal Temporal Logic specifications
Balakrishnan, Anand
Deshmukh, Jyotirmoy
[J]. 2019 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2019, : 3481 - 3486
[34] Plan-based reward shaping for multi-agent reinforcement learning
Devlin, Sam
Kudenko, Daniel
[J]. KNOWLEDGE ENGINEERING REVIEW, 2016, 31 (01): : 44 - 58
[35] Reinforcement Learning With Temporal Logic Rewards
Li, Xiao
Vasile, Cristian-Ioan
Belta, Calin
[J]. 2017 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2017, : 3834 - 3839
[36] Reinforcement Learning with Temporal Logic Constraints
Lennartson, Bengt
Jia, Qing-Shan
[J]. IFAC PAPERSONLINE, 2020, 53 (04): : 485 - 492
[37] On reward distribution in reinforcement learning of multi-agent surveillance systems with temporal logic specifications
Terashima, Keita
Kobayashi, Koichi
Yamashita, Yuh
[J]. ADVANCED ROBOTICS, 2024, 38 (06) : 386 - 397
[38] Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks
Takumi Aotani
Taisuke Kobayashi
Kenji Sugimoto
[J]. Applied Intelligence, 2021, 51 : 4434 - 4452
[39] Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks
Aotani, Takumi
Kobayashi, Taisuke
Sugimoto, Kenji
[J]. APPLIED INTELLIGENCE, 2021, 51 (07) : 4434 - 4452
[40] Generalization in Deep Reinforcement Learning for Robotic Navigation by Reward Shaping
Miranda, Victor R. F.
Neto, Armando A.
Freitas, Gustavo M.
Mozelli, Leonardo A.
[J]. IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2024, 71 (06) : 6013 - 6020

← 1 2 3 4 5 →