A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

被引:0
|
作者
Zhang, Yinmin [1 ,2 ]
Liu, Jie [2 ,3 ]
Li, Chuming [1 ,2 ]
Niu, Yazhe [2 ]
Yang, Yaodong [4 ]
Liu, Yu [2 ]
Ouyang, Wanli [2 ]
机构
[1] Univ Sydney, SenseTime Comp Vis Grp, Sydney, NSW, Australia
[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[3] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China
[4] Peking Univ, Inst AI, Beijing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.
引用
收藏
页码:16908 / 16916
页数:9
相关论文
共 50 条
  • [41] Greedy Action Selection and Pessimistic Q-Value Updating in Multi-Agent Reinforcement Learning with Sparse Interaction
    Kujirai T.
    Yokota T.
    [J]. SICE Journal of Control, Measurement, and System Integration, 2019, 12 (03) : 76 - 84
  • [42] A Novel Deep Offline-to-Online Transfer Learning Framework for Pipeline Leakage Detection With Small Samples
    Wang, Chuang
    Wang, Zidong
    Liu, Weibo
    Shen, Yuxuan
    Dong, Hongli
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [43] Towards Offline Reinforcement Learning with Pessimistic Value Priors
    Valdettaro, Filippo
    Faisal, A. Aldo
    [J]. EPISTEMIC UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, EPI UAI 2023, 2024, 14523 : 89 - 100
  • [44] Constraints Penalized Q-learning for Safe Offline Reinforcement Learning
    Xu, Haoran
    Zhan, Xianyuan
    Zhu, Xiangyu
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8753 - 8760
  • [45] RLSynC: Offline-Online Reinforcement Learning for Synthon Completion
    Baker, Frazier N.
    Chen, Ziqi
    Adu-Ampratwum, Daniel
    Ning, Xia
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (17) : 6723 - 6735
  • [46] Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess
    Li, Xiali
    Lv, Zhengyu
    Wu, Licheng
    Zhao, Yue
    Xu, Xiaona
    [J]. COMPLEXITY, 2020, 2020
  • [47] Reducing Q-Value Estimation Bias via Mutual Estimation and Softmax Operation in MADRL
    Li, Zheng
    Chen, Xinkai
    Fu, Jiaqing
    Xie, Ning
    Zhao, Tingting
    [J]. ALGORITHMS, 2024, 17 (01)
  • [48] Greedy action selection and pessimistic Q-value updates in cooperative Q-learning
    Kujirai, Toshihiro
    Yokota, Takayoshi
    [J]. 2018 57TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2018, : 821 - 826
  • [49] Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision
    Nair, Ashvin
    Zhu, Brian
    Narayanan, Gokul
    Solowjow, Eugen
    Levine, Sergey
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 7154 - 7161
  • [50] Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics
    Weissenbacher, Matthias
    Sinha, Samarth
    Garg, Animesh
    Kawahara, Yoshinobu
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,