Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems

被引:0
|
作者
Even-Dar, Eyal
Mannor, Shie
Mansour, Yishay
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
[2] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 2A7, Canada
[3] Tel Aviv Univ, Sch Comp Sci, IL-69978 Tel Aviv, Israel
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/epsilon(2)) log(1/delta)) times to find an epsilon-optimal arm with probability of at least 1-delta. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over epsilon-greedy Q-learning.
引用
收藏
页码:1079 / 1105
页数:27
相关论文
共 50 条
  • [31] The Effect of Communication on Noncooperative Multiplayer Multi-Armed Bandit Problems
    Evirgen, Noyan
    Kose, Alper
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 331 - 336
  • [32] An asymptotically optimal strategy for constrained multi-armed bandit problems
    Hyeong Soo Chang
    Mathematical Methods of Operations Research, 2020, 91 : 545 - 557
  • [33] On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems
    Kim, Baekjin
    Tewari, Ambuj
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [34] Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
    Bubeck, Sebastien
    Cesa-Bianchi, Nicolo
    FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2012, 5 (01): : 1 - 122
  • [35] Dynamic Multi-Armed Bandit with Covariates
    Pavlidis, Nicos G.
    Tasoulis, Dimitris K.
    Adams, Niall M.
    Hand, David J.
    ECAI 2008, PROCEEDINGS, 2008, 178 : 777 - +
  • [36] Scaling Multi-Armed Bandit Algorithms
    Fouche, Edouard
    Komiyama, Junpei
    Boehm, Klemens
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 1449 - 1459
  • [37] The budgeted multi-armed bandit problem
    Madani, O
    Lizotte, DJ
    Greiner, R
    LEARNING THEORY, PROCEEDINGS, 2004, 3120 : 643 - 645
  • [38] The Multi-Armed Bandit With Stochastic Plays
    Lesage-Landry, Antoine
    Taylor, Joshua A.
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2018, 63 (07) : 2280 - 2286
  • [39] Achieving Regular and Fair Learning in Combinatorial Multi-Armed Bandit
    Wu, Xiaoyi
    Li, Bin
    IEEE INFOCOM 2024-IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2024, : 361 - 370
  • [40] Multi-armed Bandit with Additional Observations
    Yun, Donggyu
    Proutiere, Alexandre
    Ahn, Sumyeong
    Shin, Jinwoo
    Yi, Yung
    PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2018, 2 (01)