Offline Reinforcement Learning with On-Policy Q-Function Regularization

被引:0
|
作者
Shi, Laixi [1 ]
Dadashi, Robert [2 ]
Chi, Yuejie [1 ]
Castro, Pablo Samuel [2 ]
Geist, Matthieu [2 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Google Res, Brain Team, Pittsburgh, PA USA
基金
美国安德鲁·梅隆基金会;
关键词
offline reinforcement learning; actor-critic; SARSA;
D O I
10.1007/978-3-031-43421-1_27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
引用
收藏
页码:455 / 471
页数:17
相关论文
共 50 条
  • [1] Offline Reinforcement Learning via Policy Regularization and Ensemble Q-Functions
    Wang, Tao
    Xie, Shaorong
    Gao, Mingke
    Chen, Xue
    Zhang, Zhenyu
    Yu, Hang
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 1167 - 1174
  • [2] Reinforcement learning via approximation of the Q-function
    Langlois, Marina
    Sloan, Robert H.
    [J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2010, 22 (03) : 219 - 235
  • [3] On-policy concurrent reinforcement learning
    Banerjee, B
    Sen, S
    Peng, J
    [J]. JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2004, 16 (04) : 245 - 260
  • [4] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Zhang, Longfei
    Zhang, Yulong
    Liu, Shixuan
    Chen, Li
    Liang, Xingxing
    Cheng, Guangquan
    Liu, Zhong
    [J]. EVOLUTIONARY INTELLIGENCE, 2024, 17 (01) : 339 - 347
  • [5] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Longfei Zhang
    Yulong Zhang
    Shixuan Liu
    Li Chen
    Xingxing Liang
    Guangquan Cheng
    Zhong Liu
    [J]. Evolutionary Intelligence, 2024, 17 : 339 - 347
  • [6] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Saeed Rahimi Gorji
    Ole-Christoffer Granmo
    [J]. Applied Intelligence, 2023, 53 : 8596 - 8613
  • [7] Tabu search exploration for on-policy reinforcement learning
    Abramson, M
    Wechsler, H
    [J]. PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2910 - 2915
  • [8] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Gorji, Saeed Rahimi
    Granmo, Ole-Christoffer
    [J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
  • [9] Offline Reinforcement Learning With Behavior Value Regularization
    Huang, Longyang
    Dong, Botao
    Xie, Wei
    Zhang, Weidong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (06) : 3692 - 3704
  • [10] Supported Value Regularization for Offline Reinforcement Learning
    Mao, Yixiu
    Zhang, Hongchang
    Chen, Chen
    Xu, Yi
    Ji, Xiangyang
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,