PROJECTED STATE-ACTION BALANCING WEIGHTS FOR OFFLINE REINFORCEMENT LEARNING

被引:1
|
作者
Wang, Jiayi [1 ]
Qi, Zhengling [2 ]
Wong, Raymond K. W. [3 ]
机构
[1] Univ Texas Dallas, Dept Math Sci, Richardson, TX 75083 USA
[2] George Washington Univ, Dept Decis Sci, Washington, DC 20052 USA
[3] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA
来源
ANNALS OF STATISTICS | 2023年 / 51卷 / 04期
基金
美国国家科学基金会;
关键词
Infinite horizons; Markov decision process; Policy evaluation; Reinforcement learning; DYNAMIC TREATMENT REGIMES; RATES; CONVERGENCE; INFERENCE;
D O I
10.1214/23-AOS2302
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Off-policy evaluation is considered a fundamental and challenging problem in reinforcement learning (RL). This paper focuses on value estimation of a target policy based on pre-collected data generated from a possibly different policy, under the framework of infinite-horizon Markov decision processes. Motivated by the recently developed marginal importance sampling method in RL and the covariate balancing idea in causal inference, we propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation. We obtain the convergence rate of these weights, and show that the proposed value estimator is asymptotically normal under technical conditions. In terms of asymptotics, our results scale with both the number of trajectories and the number of decision points at each trajectory. As such, consistency can still be achieved with a limited number of subjects when the number of decision points diverges. In addition, we develop a necessary and sufficient condition for establishing the well-posedness of the operator that relates to the nonparametric Q-function estimation in the off-policy setting, which characterizes the difficulty of Q-function estimation and may be of independent interest. Numerical experiments demonstrate the promising performance of our proposed estimator.
引用
收藏
页码:1639 / 1665
页数:27
相关论文
共 50 条
  • [31] Reinforcement learning in dynamic environment -Abstraction of state-action space utilizing properties of the robot body and environment-
    Takeuchi, Yutaka
    Ito, Kazuyuki
    PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL LIFE AND ROBOTICS (AROB 17TH '12), 2012, : 938 - 942
  • [32] R-learning with multiple state-action value tables
    Japan Advanced Institute of Science and Technology, Japan
    不详
    Electrical Engineering in Japan (English translation of Denki Gakkai Ronbunshi), 2007, 159 (03): : 34 - 47
  • [33] Learning Pseudometric-based Action Representations for Offline Reinforcement Learning
    Gu, Pengjie
    Zhao, Mengchen
    Chen, Chen
    Li, Dong
    Hao, Jianye
    An, Bo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [34] Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning
    Luo, Jianlan
    Dong, Perry
    Wu, Jeffrey
    Kumar, Aviral
    Geng, Xinyang
    Levine, Sergey
    CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
  • [35] Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning (Abstract Reprint)
    Liu, Vincent
    Wright, James R.
    White, Mrtha
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22706 - 22706
  • [36] Conservative State Value Estimation for Offline Reinforcement Learning
    Chen, Liting
    Yan, Jie
    Shao, Zhengdao
    Wang, Lu
    Lin, Qingwei
    Rajmohan, Saravan
    Moscibroda, Thomas
    Zhang, Dongmei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [37] How Should Learning Classifier Systems Cover A State-Action Space?
    Nakata, Masaya
    Lanzi, Pier Luca
    Kovacs, Tim
    Browne, Will Neil
    Takadama, Keiki
    2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 3012 - 3019
  • [38] Reinforcement learning in multi-dimensional state-action space using random rectangular coarse coding and Gibbs sampling
    Kimura, Hajime
    2007 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-9, 2007, : 88 - 95
  • [39] Undesired state-action prediction in multi-agent reinforcement learning for linked multi-component robotic system control
    Fernandez-Gauna, Borja
    Marques, Ion
    Grana, Manuel
    INFORMATION SCIENCES, 2013, 232 : 309 - 324
  • [40] State Action Separable Reinforcement Learning
    Zhang, Ziyao
    Ma, Liang
    Leung, Kin K.
    Poularakis, Konstantinos
    Srivatsa, Mudhakar
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 123 - 132