Learning reward machines: A study in partially observable reinforcement learning

被引：2

作者：

Icarte, Rodrigo Toro ^{[1
,6
]}

Klassen, Toryn Q. ^{[3
,4
]}

Valenzano, Richard ^{[5
]}

Castro, Margarita P. ^{[2
,6
]}

Waldie, Ethan ^{[3
]}

Mcilraith, Sheila A. ^{[3
,4
]}

机构：

[1] Pontificia Univ Catolica Chile, Dept Comp Sci, Santiago, RM, Chile

[2] Pontificia Univ Catolica Chile PUC, Dept Ind & Syst Engn, Santiago, RM, Chile

[3] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada

[4] Vector Inst Artificial Intelligence, Toronto, ON, Canada

[5] Toronto Metropolitan Univ, Toronto, ON, Canada

[6] Ctr Nacl Inteligencia Artificial CENIA, Santiago, RM, Chile

来源：

ARTIFICIAL INTELLIGENCE | 2023年 / 323卷

关键词：

Reinforcement learning; Reward machines; Partial observability; Automata learning; Abstractions; Non-Markovian environments;

D O I：

10.1016/j.artint.2023.103989

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Reinforcement Learning (RL) is a machine learning paradigm wherein an artificial agent interacts with an environment with the purpose of learning behaviour that maximizes the expected cumulative reward it receives from the environment. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential.1 & COPY; 2023 Elsevier B.V. All rights reserved.

引用

页数：27

共 50 条

[1] Learning Reward Machines for Partially Observable Reinforcement Learning
Icarte, Rodrigo Toro
Waldie, Ethan
Klassen, Toryn Q.
Valenzano, Richard
Castro, Margarita P.
McIlraith, Sheila A.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[2] Inverse reinforcement learning in partially observable environments
Choi, Jaedeug
Kim, Kee-Eung
[J]. Journal of Machine Learning Research, 2011, 12 : 691 - 730
[3] Inverse Reinforcement Learning in Partially Observable Environments
Choi, Jaedeug
Kim, Kee-Eung
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2011, 12 : 691 - 730
[4] Reinforcement Learning with Stochastic Reward Machines
Corazza, Jan
Gavran, Ivan
Neider, Daniel
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 6429 - 6436
[5] Blockwise Sequential Model Learning for Partially Observable Reinforcement Learning
Park, Giseung
Choi, Sungho
Sung, Youngchul
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 7941 - 7948
[6] Inverse Reinforcement Learning in Partially Observable Environments
Choi, Jaedeug
Kim, Kee-Eung
[J]. 21ST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-09), PROCEEDINGS, 2009, : 1028 - 1033
[7] Partially Observable Reinforcement Learning for Sustainable Active Surveillance
Chen, Hechang
Yang, Bo
Liu, Jiming
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2018, PT II, 2018, 11062 : 425 - 437
[8] Regret Minimization for Partially Observable Deep Reinforcement Learning
Jin, Peter
Keutzer, Kurt
Levine, Sergey
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
[9] Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning
Icarte, Rodrigo Toro
Klassen, Toryn Q.
Valenzano, Richard
McIlraith, Sheila A.
[J]. Journal of Artificial Intelligence Research, 2022, 73 : 173 - 208
[10] Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning
Icarte, Rodrigo Toro
Klassen, Toryn Q.
Valenzano, Richard
Mcllraith, Sheila A.
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2022, 73 : 173 - 208

← 1 2 3 4 5 →