A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

被引:0
|
作者
Zhang, Yinmin [1 ,2 ]
Liu, Jie [2 ,3 ]
Li, Chuming [1 ,2 ]
Niu, Yazhe [2 ]
Yang, Yaodong [4 ]
Liu, Yu [2 ]
Ouyang, Wanli [2 ]
机构
[1] Univ Sydney, SenseTime Comp Vis Grp, Sydney, NSW, Australia
[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[3] Chinese Univ Hong Kong, Multimedia Lab, Hong Kong, Peoples R China
[4] Peking Univ, Inst AI, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.
引用
收藏
页码:16908 / 16916
页数:9
相关论文
共 50 条
  • [1] Sample Efficient Offline-to-Online Reinforcement Learning
    Guo, Siyuan
    Zou, Lixin
    Chen, Hechang
    Qu, Bohao
    Chi, Haotian
    Yu, Philip S.
    Chang, Yi
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (03) : 1299 - 1310
  • [2] Adaptive Policy Learning for Offline-to-Online Reinforcement Learning
    Zheng, Han
    Luo, Xufang
    Wei, Pengfei
    Song, Xuan
    Li, Dongsheng
    Jiang, Jing
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 11372 - 11380
  • [3] Adaptive pessimism via target Q-value for offline reinforcement learning
    Liu, Jie
    Zhang, Yinmin
    Li, Chuming
    Yang, Yaodong
    Liu, Yu
    Ouyang, Wanli
    [J]. NEURAL NETWORKS, 2024, 180
  • [4] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Longfei Zhang
    Yulong Zhang
    Shixuan Liu
    Li Chen
    Xingxing Liang
    Guangquan Cheng
    Zhong Liu
    [J]. Evolutionary Intelligence, 2024, 17 : 339 - 347
  • [5] ORAD: a new framework of offline Reinforcement Learning with Q-value regularization
    Zhang, Longfei
    Zhang, Yulong
    Liu, Shixuan
    Chen, Li
    Liang, Xingxing
    Cheng, Guangquan
    Liu, Zhong
    [J]. EVOLUTIONARY INTELLIGENCE, 2024, 17 (01) : 339 - 347
  • [6] Learning Aerial Docking via Offline-to-Online Reinforcement Learning
    Tao, Yang
    Feng Yuting
    Yu, Yushu
    [J]. 2024 4TH INTERNATIONAL CONFERENCE ON COMPUTER, CONTROL AND ROBOTICS, ICCCR 2024, 2024, : 305 - 309
  • [7] A Swapping Target Q-Value Technique for Data Augmentation in Offline Reinforcement Learning
    Joo, Ho-Taek
    Baek, In-Chang
    Kim, Kyung-Joong
    [J]. IEEE ACCESS, 2022, 10 : 57369 - 57382
  • [8] Effective Traffic Signal Control with Offline-to-Online Reinforcement Learning
    Ma, Jinming
    Wu, Feng
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5567 - 5573
  • [9] DCAC: Reducing Unnecessary Conservatism in Offline-to-online Reinforcement Learning
    Chen, Dongxiang
    Wen, Ying
    [J]. 2023 5TH INTERNATIONAL CONFERENCE ON DISTRIBUTED ARTIFICIAL INTELLIGENCE, DAI 2023, 2023,
  • [10] Ensemble successor representations for task generalization in offline-to-online reinforcement learning
    Changhong WANG
    Xudong YU
    Chenjia BAI
    Qiaosheng ZHANG
    Zhen WANG
    [J]. Science China(Information Sciences), 2024, (07) - 255