A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

被引：0

作者：

Zhang, Yinmin ^{[1
,2
]}

Liu, Jie ^{[2
,3
]}

Li, Chuming ^{[1
,2
]}

Niu, Yazhe ^{[2
]}

Yang, Yaodong ^{[4
]}

Liu, Yu ^{[2
]}

Ouyang, Wanli ^{[2
]}

机构：

[1] Univ Sydney, SenseTime Comp Vis Grp, Sydney, NSW, Australia

[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China

[3] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China

[4] Peking Univ, Inst AI, Beijing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15 | 2024年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.

引用

页码：16908 / 16916

页数：9

共 50 条

[41] Greedy Action Selection and Pessimistic Q-Value Updating in Multi-Agent Reinforcement Learning with Sparse Interaction
Kujirai T.
Yokota T.
[J]. SICE Journal of Control, Measurement, and System Integration, 2019, 12 (03) : 76 - 84
[42] A Novel Deep Offline-to-Online Transfer Learning Framework for Pipeline Leakage Detection With Small Samples
Wang, Chuang
Wang, Zidong
Liu, Weibo
Shen, Yuxuan
Dong, Hongli
[J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
[43] Towards Offline Reinforcement Learning with Pessimistic Value Priors
Valdettaro, Filippo
Faisal, A. Aldo
[J]. EPISTEMIC UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, EPI UAI 2023, 2024, 14523 : 89 - 100
[44] Constraints Penalized Q-learning for Safe Offline Reinforcement Learning
Xu, Haoran
Zhan, Xianyuan
Zhu, Xiangyu
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 8753 - 8760
[45] RLSynC: Offline-Online Reinforcement Learning for Synthon Completion
Baker, Frazier N.
Chen, Ziqi
Adu-Ampratwum, Daniel
Ning, Xia
[J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (17) : 6723 - 6735
[46] Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess
Li, Xiali
Lv, Zhengyu
Wu, Licheng
Zhao, Yue
Xu, Xiaona
[J]. COMPLEXITY, 2020, 2020
[47] Reducing Q-Value Estimation Bias via Mutual Estimation and Softmax Operation in MADRL
Li, Zheng
Chen, Xinkai
Fu, Jiaqing
Xie, Ning
Zhao, Tingting
[J]. ALGORITHMS, 2024, 17 (01)
[48] Greedy action selection and pessimistic Q-value updates in cooperative Q-learning
Kujirai, Toshihiro
Yokota, Takayoshi
[J]. 2018 57TH ANNUAL CONFERENCE OF THE SOCIETY OF INSTRUMENT AND CONTROL ENGINEERS OF JAPAN (SICE), 2018, : 821 - 826
[49] Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision
Nair, Ashvin
Zhu, Brian
Narayanan, Gokul
Solowjow, Eugen
Levine, Sergey
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 7154 - 7161
[50] Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics
Weissenbacher, Matthias
Sinha, Samarth
Garg, Animesh
Kawahara, Yoshinobu
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,

← 1 2 3 4 5 →