Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance

被引:33
|
作者
Knox, W. Bradley [1 ]
Stone, Peter [2 ]
机构
[1] MIT, Media Lab, Cambridge, MA 02139 USA
[2] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
基金
美国国家科学基金会;
关键词
Reinforcement learning; Modeling user behavior; End-user programming; Human-agent interaction; Interactive machine learning; Human teachers; ROBOT;
D O I
10.1016/j.artint.2015.03.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several studies have demonstrated that reward from a human trainer can be a powerful feedback signal for control-learning algorithms. However, the space of algorithms for learning from such human reward has hitherto not been explored systematically. Using model-based reinforcement learning from human reward, this article investigates the problem of learning from human reward through six experiments, focusing on the relationships between reward positivity, which is how generally positive a trainer's reward values are; temporal discounting, the extent to which future reward is discounted in value; episodicity, whether task learning occurs in discrete learning episodes instead of one continuing session; and task performance, the agent's performance on the task the trainer intends to teach. This investigation is motivated by the observation that an agent can pursue different learning objectives, leading to different resulting behaviors. We search for learning objectives that lead the agent to behave as the trainer intends. We identify and empirically support a "positive circuits" problem with low discounting (i.e., high discount factors) for episodic, goal-based tasks that arises from an observed bias among humans towards giving positive reward, resulting in an endorsement of myopic learning for such domains. We then show that converting simple episodic tasks to be non-episodic (i.e., continuing) reduces and in some cases resolves issues present in episodic tasks with generally positive reward and-relatedly-enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm introduced in this article, which we call "VI-TAMER", is the first algorithm to successfully learn non-myopically from reward generated by a human trainer; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform further studies-one with a failure state added-that compare (1) learning when states are updated asynchronously with local bias-i.e., states quickly reachable from the agent's current state are updated more often than other states-to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:24 / 50
页数:27
相关论文
共 50 条
  • [1] Reinforcement learning and the reward positivity with aversive outcomes
    Bauer, Elizabeth A.
    Watanabe, Brandon K.
    Macnamara, Annmarie
    [J]. PSYCHOPHYSIOLOGY, 2024, 61 (04)
  • [2] Control of movements and temporal discounting of reward
    Shadmehr, Reza
    [J]. CURRENT OPINION IN NEUROBIOLOGY, 2010, 20 (06) : 726 - 730
  • [3] The neural correlates of temporal reward discounting
    Scheres, Anouk
    de Water, Erik
    Mies, Gabry W.
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COGNITIVE SCIENCE, 2013, 4 (05) : 523 - 545
  • [4] Temporal discounting in human reward system:: A fMRI-study
    Nuesser, Corinna
    Erk, S.
    Staudinger, M.
    Goschke, T.
    Walter, H.
    [J]. NERVENARZT, 2007, 78 : 276 - 276
  • [5] Temporal reward discounting in children and adolescents with ADHD: Effect of reward magnitude
    Scheres, Anouk
    Lee, Allison
    Tontsch, Chandra
    Kaczkurkin, Antonia
    [J]. BIOLOGICAL PSYCHIATRY, 2007, 61 (08) : 65S - 65S
  • [6] Rate of temporal discounting decreases with amount of reward
    Green, L
    Myerson, J
    McFadden, E
    [J]. MEMORY & COGNITION, 1997, 25 (05) : 715 - 723
  • [7] Rate of temporal discounting decreases with amount of reward
    Leonard Green
    Joel Myerson
    Edward Mcfadden
    [J]. Memory & Cognition, 1997, 25 : 715 - 723
  • [8] Average Reward Optimization with Multiple Discounting Reinforcement Learners
    Reinke, Chris
    Uchibe, Eiji
    Doya, Kenji
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2017, PT I, 2017, 10634 : 789 - 800
  • [9] Lifelong reinforcement learning with temporal logic formulas and reward machines
    Zheng, Xuejing
    Yu, Chao
    Zhang, Minjie
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 257
  • [10] LEARNING RELATED CHANGES IN THE REWARD POSITIVITY
    Krigolson, Olav
    [J]. PSYCHOPHYSIOLOGY, 2016, 53 : S10 - S10