PROJECTED STATE-ACTION BALANCING WEIGHTS FOR OFFLINE REINFORCEMENT LEARNING

被引:0
|
作者
Wang, Jiayi [1 ]
Qi, Zhengling [2 ]
Wong, Raymond K. W. [3 ]
机构
[1] Univ Texas Dallas, Dept Math Sci, Richardson, TX 75083 USA
[2] George Washington Univ, Dept Decis Sci, Washington, DC 20052 USA
[3] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA
来源
ANNALS OF STATISTICS | 2023年 / 51卷 / 04期
基金
美国国家科学基金会;
关键词
Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning; DYNAMIC TREATMENT REGIMES; RATES; CONVERGENCE; INFERENCE;
D O I
10.1214/23-AOS2302
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
引用
收藏
页码:1639 / 1665
页数:27
相关论文
共 50 条
  • [1] For SALE: State-Action Representation Learning for Deep Reinforcement Learning
    Fujimoto, Scott
    Chang, Wei-Di
    Smith, Edward J.
    Gu, Shixiang Shane
    Precup, Doina
    Meger, David
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] Enhancing visual reinforcement learning with State-Action Representation
    Yan, Mengbei
    Lyu, Jiafei
    Li, Xiu
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [3] A REINFORCEMENT LEARNING MODEL USING DETERMINISTIC STATE-ACTION SEQUENCES
    Murata, Makoto
    Ozawa, Seiichi
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2010, 6 (02): : 577 - 590
  • [4] Efficient Reinforcement Learning Using State-Action Uncertainty with Multiple Heads
    Aizu, Tomoharu
    Oba, Takeru
    Ukita, Norimichi
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VIII, 2023, 14261 : 184 - 196
  • [5] Speeding up Tabular Reinforcement Learning Using State-Action Similarities
    Rosenfeld, Ariel
    Taylor, Matthew E.
    Kraus, Sarit
    [J]. AAMAS'17: PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2017, : 1722 - 1724
  • [6] Swarm Reinforcement Learning Methods for Problems with Continuous State-Action Space
    Iima, Hitoshi
    Kuroe, Yasuaki
    Emoto, Kazuo
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2011, : 2173 - 2180
  • [7] Model-Based Reinforcement Learning Exploiting State-Action Equivalence
    Asadi, Mahsa
    Talebi, Mohammad Sadegh
    Bourel, Hippolyte
    Maillard, Odalric-Ambrym
    [J]. ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 204 - 219
  • [8] Jointly-Learned State-Action Embedding for Efficient Reinforcement Learning
    Pritz, Paul J.
    Ma, Liang
    Leung, Kin K.
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 1447 - 1456
  • [9] A Plume-Tracing Strategy via Continuous State-action Reinforcement Learning
    Niu, Lvyin
    Song, Shiji
    You, Keyou
    [J]. 2017 CHINESE AUTOMATION CONGRESS (CAC), 2017, : 759 - 764
  • [10] Near-continuous time Reinforcement Learning for continuous state-action spaces
    Croissant, Lorenzo
    Abeille, Marc
    Bouchard, Bruno
    [J]. INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY, VOL 237, 2024, 237