A DATASET PERSPECTIVE ON OFFLINE REINFORCEMENT LEARNING

被引：0

作者：

Schweighofer, Kajetan ^{[1
,2
]}

Radler, Andreas ^{[1
,2
]}

Dinu, Marius-Constantin ^{[1
,2
,4
]}

Hofmarcher, Markus ^{[1
,2
]}

Patil, Vihang ^{[1
,2
]}

Bitto-Nemling, Angela ^{[1
,2
,3
]}

Eghbal-zadeh, Hamid ^{[1
,2
,3
]}

Hochreiter, Sepp ^{[1
,2
]}

机构：

[1] Johannes Kepler Univ Linz, ELLIS Unit Linz, Inst Machine Learning, Linz, Austria

[2] Johannes Kepler Univ Linz, Inst Machine Learning, LIT AI Lab, Linz, Austria

[3] IARAI, Vienna, Austria

[4] Dynatrace Res, Linz, Austria

来源：

CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 199 | 2022年 / 199卷

基金：

欧盟地平线“2020”;

关键词：

CONCEPT DRIFT;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence Offline RL algorithms is still hardly investigated. The dataset characteristics are determined by the behavioral policy that samples this dataset. Therefore, we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. We implement two corresponding empirical measures for the datasets sampled by the behavioral policy in deterministic MDPs. The first empirical measure SACo is defined by the normalized unique state-action pairs and captures exploration. The second empirical measure TQ is defined by the normalized average trajectory return and captures exploitation. Empirical evaluations show the effectiveness of TQ and SACo. In large-scale experiments using our proposed measures, we show that the unconstrained off-policy Deep Q-Network family requires datasets with high SACo to find a good policy. Furthermore, experiments show that policy constraint algorithms perform well on datasets with high TQ and SACo. Finally, the experiments show, that purely dataset-constrained Behavioral Cloning performs competitively to the best Offline RL algorithms for datasets with high TQ. [GRAPHICS] .

引用

页数：48

共 50 条

[31] Robust Reinforcement Learning using Offline Data
Panaganti, Kishan
Xu, Zaiyan
Kalathil, Dileep
Ghavamzadeh, Mohammad
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[32] Discrete Uncertainty Quantification For Offline Reinforcement Learning
Perez, Jose Luis
Corrochano, Javier
Garcia, Javier
Majadas, Ruben
Ibanez-Llano, Cristina
Perez, Sergio
Fernandez, Fernando
JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2023, 13 (04) : 273 - 287
[33] Supported Value Regularization for Offline Reinforcement Learning
Mao, Yixiu
Zhang, Hongchang
Chen, Chen
Xu, Yi
Ji, Xiangyang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[34] Boundary Data Augmentation for Offline Reinforcement Learning
SHEN Jiahao
JIANG Ke
TAN Xiaoyang
ZTECommunications, 2023, 21 (03) : 29 - 36
[35] Fast Rates for the Regret of Offline Reinforcement Learning
Hu, Yichun
Kallus, Nathan
Uehara, Masatoshi
MATHEMATICS OF OPERATIONS RESEARCH, 2025, 50 (01)
[36] Mutual Information Regularized Offline Reinforcement Learning
Ma, Xiao
Kang, Bingyi
Xu, Zhongwen
Lin, Min
Yan, Shuicheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[37] Revisiting the Minimalist Approach to Offline Reinforcement Learning
Tarasov, Denis
Kurenkov, Vladislav
Nikulin, Alexander
Kolesnikov, Sergey
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[38] Bellman Residual Orthogonalization for Offline Reinforcement Learning
Zanette, Andrea
Wainwright, Martin J.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[39] Offline Reinforcement Learning with Behavioral Supervisor Tuning
Srinivasan, Padmanaba
Knottenbelt, William
PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 4929 - 4937
[40] Offline Reinforcement Learning for Automated Stock Trading
Lee, Namyeong
Moon, Jun
IEEE ACCESS, 2023, 11 : 112577 - 112589

← 1 2 3 4 5 →