Minimax weight learning for absorbing MDPs

被引:0
|
作者
Li, Fengying [1 ]
Li, Yuqiang [1 ]
Wu, Xianyi [1 ]
机构
[1] East China Normal Univ, Sch Stat, KLATASDS MOE, Shanghai 200062, Peoples R China
基金
国家重点研发计划;
关键词
Absorbing MDP; Off-policy; Minimax weight learning; Policy evaluation; Occupancy measure; MODELS;
D O I
10.1007/s00362-023-01491-4
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment
引用
收藏
页码:3545 / 3582
页数:38
相关论文
共 50 条
  • [21] Cooperative Online Learning in Stochastic and Adversarial MDPs
    Lancewicki, Tal
    Rosenberg, Aviv
    Mansour, Yishay
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [22] Reduction Techniques for Model Checking and Learning in MDPs
    Bharadwaj, Suda
    Le Roux, Stephane
    Perez, Guillermo A.
    Topcu, Ufuk
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4273 - 4279
  • [23] Learning option MDPs from small data
    Zehfroosh, Ashkan
    Tanner, Herbert G.
    Heinz, Jeffrey
    2018 ANNUAL AMERICAN CONTROL CONFERENCE (ACC), 2018, : 252 - 257
  • [24] Minimax Model Learning
    Voloshin, Cameron
    Jiang, Nan
    Yue, Yisong
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [25] Reinforcement Learning in Finite MDPs: PAC Analysis
    Strehl, Alexander L.
    Li, Lihong
    Littman, Michael L.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2009, 10 : 2413 - 2444
  • [26] Reinforcement Learning in Reward-Mixing MDPs
    Kwon, Jeongyeol
    Efroni, Yonathan
    Caramanis, Constantine
    Mannor, Shie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [27] TeXDYNA: Hierarchical Reinforcement Learning in Factored MDPs
    Kozlova, Olga
    Sigaud, Olivier
    Meyer, Christophe
    FROM ANIMALS TO ANIMATS 11, 2010, 6226 : 489 - +
  • [28] Safety-Constrained Reinforcement Learning for MDPs
    Junges, Sebastian
    Jansen, Nils
    Dehnert, Christian
    Topcu, Ufuk
    Katoen, Joost-Pieter
    TOOLS AND ALGORITHMS FOR THE CONSTRUCTION AND ANALYSIS OF SYSTEMS (TACAS 2016), 2016, 9636 : 130 - 146
  • [29] Learning to Act in Decentralized Partially Observable MDPs
    Dibangoye, Jilles S.
    Buffet, Olivier
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [30] Belief Propagation for MiniMax Weight Matching
    Yuan, Mindi
    Li, Shen
    Shen, Wei
    Pavlidis, Yannis
    MODELLING, COMPUTATION AND OPTIMIZATION IN INFORMATION SYSTEMS AND MANAGEMENT SCIENCES - MCO 2015, PT 1, 2015, 359 : 37 - 45