Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

被引:0
|
作者
Patil, Vihang [1 ,2 ]
Hofmarcher, Markus [1 ,2 ]
Dinu, Marius-Constantin [1 ,2 ,3 ]
Dorfer, Matthias [4 ]
Blies, Patrick [4 ]
Brandstetter, Johannes [1 ,2 ,5 ]
Arjona-Medina, Jose [1 ,2 ,3 ]
Hochreiter, Sepp [1 ,2 ,6 ]
机构
[1] Johannes Kepler Univ Linz, Inst Machine Learning, ELLIS Unit Linz, Linz, Austria
[2] Johannes Kepler Univ Linz, Inst Machine Learning, LIT AI Lab, Linz, Austria
[3] Dynatrace Res, Linz, Austria
[4] EnliteAI, Vienna, Austria
[5] Microsoft Res, Redmond, WA USA
[6] Inst Adv Res Artificial Intelligence, Vienna, Austria
基金
欧盟地平线“2020”;
关键词
MULTIPLE SEQUENCE ALIGNMENT; NEURAL-NETWORKS; ALGORITHM; SEARCH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at github.com/ml-jku/align-rudder.
引用
收藏
页数:42
相关论文
共 50 条
  • [21] Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences
    Biyik, Erdem
    Losey, Dylan P.
    Palan, Malayandi
    Landolfi, Nicholas C.
    Shevchuk, Gleb
    Sadigh, Dorsa
    INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2022, 41 (01): : 45 - 67
  • [22] Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach
    Zhang, Yudi
    Du, Yali
    Huang, Biwei
    Wang, Ziyan
    Wang, Jun
    Fang, Meng
    Pechenizkiy, Mykola
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] Reward Redistribution for Reinforcement Learning of Dynamic Nonprehensile Manipulation
    Sejnova, Gabriela
    Mejdrechova, Megi
    Otahal, Marek
    Sokovnin, Nikita
    Farkas, Igor
    Vavrecka, Michal
    2021 7TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND ROBOTICS (ICCAR), 2021, : 326 - 331
  • [24] Sparse Reward based Manipulator Motion Planning by Using High Speed Learning from Demonstrations
    Zuo, Guoyu
    Lu, Jiahao
    Pan, Tingting
    2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO), 2018, : 518 - 523
  • [25] Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments
    Rengarajan, Desik
    Chaudhary, Sapana
    Kim, Jaewon
    Kalathil, Dileep
    Shakkottai, Srinivas
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [26] Learning From Sparse Demonstrations
    Jin, Wanxin
    Murphey, Todd D.
    Kulic, Dana
    Ezer, Neta
    Mou, Shaoshuai
    IEEE TRANSACTIONS ON ROBOTICS, 2023, 39 (01) : 645 - 664
  • [27] Learning to Generalize from Demonstrations
    Browne, Katie
    Nicolescu, Monica
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2012, 12 (03) : 27 - 38
  • [28] Learning from Corrective Demonstrations
    Gutierrez, Reymundo A.
    Short, Elaine Schaertl
    Niekum, Scott
    Thomaz, Andrea L.
    HRI '19: 2019 14TH ACM/IEEE INTERNATIONAL CONFERENCE ON HUMAN-ROBOT INTERACTION, 2019, : 712 - 714
  • [29] Joint Estimation of Expertise and Reward Preferences From Human Demonstrations
    Carreno-Medrano, Pamela
    Smith, Stephen L.
    Kulic, Dana
    IEEE TRANSACTIONS ON ROBOTICS, 2023, 39 (01) : 681 - 698
  • [30] Learning and generalization of task-parameterized skills through few human demonstrations
    Prados, Adrian
    Garrido, Santiago
    Barber, Ramon
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133