Offline Reinforcement Learning with On-Policy Q-Function Regularization

被引：0

作者：

Shi, Laixi ^{[1
]}

Dadashi, Robert ^{[2
]}

Chi, Yuejie ^{[1
]}

Castro, Pablo Samuel ^{[2
]}

Geist, Matthieu ^{[2
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Google Res, Brain Team, Pittsburgh, PA USA

来源：

MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV | 2023年 / 14172卷

基金：

美国安德鲁·梅隆基金会;

关键词：

offline reinforcement learning; actor-critic; SARSA;

D O I：

10.1007/978-3-031-43421-1_27

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.

引用

页码：455 / 471

页数：17

共 50 条

[1] Offline Reinforcement Learning via Policy Regularization and Ensemble Q-Functions
Wang, Tao
Xie, Shaorong
Gao, Mingke
Chen, Xue
Zhang, Zhenyu
Yu, Hang
[J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1167 - 1174
[2] Reinforcement learning via approximation of the Q-function
Langlois, Marina
Sloan, Robert H.
[J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2010, 22 (03) : 219 - 235
[3] On-policy concurrent reinforcement learning
Banerjee, B
Sen, S
Peng, J
[J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2004, 16 (04) : 245 - 260
[4] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
Zhang, Longfei
Zhang, Yulong
Liu, Shixuan
Chen, Li
Liang, Xingxing
Cheng, Guangquan
Liu, Zhong
[J]. EVOLUTIONARY INTELLIGENCE, 2024, 17 (01) : 339 - 347
[5] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
Longfei Zhang
Yulong Zhang
Shixuan Liu
Li Chen
Xingxing Liang
Guangquan Cheng
Zhong Liu
[J]. Evolutionary Intelligence, 2024, 17 : 339 - 347
[6] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Saeed Rahimi Gorji
Ole-Christoffer Granmo
[J]. Applied Intelligence, 2023, 53 : 8596 - 8613
[7] Tabu search exploration for on-policy reinforcement learning
Abramson, M
Wechsler, H
[J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2910 - 2915
[8] Off-policy and on-policy reinforcement learning with the Tsetlin machine
Gorji, Saeed Rahimi
Granmo, Ole-Christoffer
[J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
[9] Offline Reinforcement Learning With Behavior Value Regularization
Huang, Longyang
Dong, Botao
Xie, Wei
Zhang, Weidong
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3692 - 3704
[10] Supported Value Regularization for Offline Reinforcement Learning
Mao, Yixiu
Zhang, Hongchang
Chen, Chen
Xu, Yi
Ji, Xiangyang
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,

← 1 2 3 4 5 →