Minimax weight learning for absorbing MDPs

被引:0
|
作者
Li, Fengying [1 ]
Li, Yuqiang [1 ]
Wu, Xianyi [1 ]
机构
[1] East China Normal Univ, Sch Stat, KLATASDS MOE, Shanghai 200062, Peoples R China
基金
国家重点研发计划;
关键词
Absorbing MDP; Off-policy; Minimax weight learning; Policy evaluation; Occupancy measure; MODELS;
D O I
10.1007/s00362-023-01491-4
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment
引用
收藏
页码:3545 / 3582
页数:38
相关论文
共 50 条
  • [1] Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs
    He, Jiafan
    Zhou, Dongruo
    Gu, Quanquan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
    Domingues, Omar Darwiche
    Menard, Pierre
    Kaufmann, Emilie
    Valko, Michal
    ALGORITHMIC LEARNING THEORY, VOL 132, 2021, 132
  • [3] Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
    Wu, Yue
    Zhou, Dongruo
    Gu, Quanquan
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 151, 2022, 151
  • [4] ε-MDPs:: Learning in varying environments
    Szita, I
    Takács, B
    Lorincz, A
    JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (01) : 145 - 173
  • [5] Learning to Branch with Tree MDPs
    Scavuzzo, Lara
    Chen, Feng Yang
    Chetelat, Didier
    Gasse, Maxime
    Lodi, Andrea
    Yorke-Smith, Neil
    Aardal, Karen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Reinforcement learning for MDPs with constraints
    Geibel, Peter
    MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 646 - 653
  • [7] Efficient reinforcement learning in factored MDPs
    Kearns, M
    Koller, D
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 740 - 747
  • [8] Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
    Sutton, RS
    Precup, D
    Singh, S
    ARTIFICIAL INTELLIGENCE, 1999, 112 (1-2) : 181 - 211
  • [9] Multitask reinforcement learning on the distribution of MDPs
    Tanaka, F
    Yamamura, M
    2003 IEEE INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION, VOLS I-III, PROCEEDINGS, 2003, : 1108 - 1113
  • [10] Expedited Learning in MDPs with Side Information
    Ornik, Melkior
    Fu, Jie
    Lauffer, Niklas T.
    Perera, W. K.
    Alshiekh, Mohammed
    Ono, Masahiro
    Topcu, Ufuk
    2018 IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2018, : 1941 - 1948