Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

被引：0

作者：

Zhang, Jing ^{[1
]}

Zhang, Chi ^{[2
]}

Wang, Wenjia ^{[1
,3
]}

Jing, Bing-Yi ^{[4
]}

机构：

[1] HKUST, Hong Kong, Peoples R China

[2] Kuaishou Technol, Beijing, Peoples R China

[3] HKUST GZ, Guangzhou, Peoples R China

[4] SUSTech, Shenzhen, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

MAXIMUM-LIKELIHOOD-ESTIMATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the Q function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal Q-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

引用

页数：15

共 50 条

[21] Offline Reinforcement Learning With Reverse Diffusion Guide Policy
Zhang, Jiazhi
Cheng, Yuhu
Cao, Shuo
Wang, Xuesong
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, : 1 - 9
[22] Inverse Reinforcement Learning with Explicit Policy Estimates
Sanghvi, Navyata
Usami, Shinnosuke
Sharma, Mohit
Groeger, Joachim
Kitani, Kris
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9472 - 9480
[23] Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization
Wen, Lu
Duan, Jingliang
Li, Shengbo Eben
Xu, Shaobing
Peng, Huei
2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
[24] Advantage Constrained Proximal Policy Optimization in Multi-Agent Reinforcement Learning
Li, Weifan
Zhu, Yuanheng
Zhao, Dongbin
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[25] LAPO: Latent-Variable Advantage-Weighted Policy Optimization for Offline Reinforcement Learning
Chen, Xi
Ghadirzadeh, Ali
Yu, Tianhe
Wang, Jianhao
Gao, Yuan
Li, Wenzhe
Liang, Bin
Finn, Chelsea
Zhang, Chongjie
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[26] Reward-free offline reinforcement learning: Optimizing behavior policy via action exploration
Huang Z.
Sun S.
Zhao J.
Knowledge-Based Systems, 2024, 299
[27] Constrained Policy Improvement for Efficient Reinforcement Learning
Sarafian, Elad
Tamar, Aviv
Kraus, Sarit
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2863 - 2871
[28] A Primal-Dual-Critic Algorithm for Offline Constrained Reinforcement Learning
Hong, Kihyuk
Li, Yuhang
Tewari, Ambuj
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[29] Energy-Based Policy Constraint for Offline Reinforcement Learning
Peng, Zhiyong
Han, Changlin
Liu, Yadong
Zhou, Zongtan
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT II, 2024, 14474 : 335 - 346
[30] A Policy-Guided Imitation Approach for Offline Reinforcement Learning
Xu, Haoran
Jiang, Li
Li, Jianxiong
Zhan, Xianyuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →