Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

被引:0
|
作者
Patil, Vihang [1 ,2 ]
Hofmarcher, Markus [1 ,2 ]
Dinu, Marius-Constantin [1 ,2 ,3 ]
Dorfer, Matthias [4 ]
Blies, Patrick [4 ]
Brandstetter, Johannes [1 ,2 ,5 ]
Arjona-Medina, Jose [1 ,2 ,3 ]
Hochreiter, Sepp [1 ,2 ,6 ]
机构
[1] Johannes Kepler Univ Linz, Inst Machine Learning, ELLIS Unit Linz, Linz, Austria
[2] Johannes Kepler Univ Linz, Inst Machine Learning, LIT AI Lab, Linz, Austria
[3] Dynatrace Res, Linz, Austria
[4] EnliteAI, Vienna, Austria
[5] Microsoft Res, Redmond, WA USA
[6] Inst Adv Res Artificial Intelligence, Vienna, Austria
基金
欧盟地平线“2020”;
关键词
MULTIPLE SEQUENCE ALIGNMENT; NEURAL-NETWORKS; ALGORITHM; SEARCH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at github.com/ml-jku/align-rudder.
引用
收藏
页数:42
相关论文
共 50 条
  • [31] Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations
    Wilcox, Albert
    Balakrishna, Ashwin
    Dedieu, Jules
    Benslimane, Wyame
    Brown, Daniel S.
    Goldberg, Ken
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [32] Generation of Roles in Reinforcement Learning Considering Redistribution of Reward between Agents
    Nakahara, Masayuki
    Osana, Yuko
    2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 2259 - +
  • [33] Learning Options for an MDP from Demonstrations
    Tamassia, Marco
    Zambetta, Fabio
    Raffe, William
    Li, Xiaodong
    ARTIFICIAL LIFE AND COMPUTATIONAL INTELLIGENCE, 2015, 8955 : 226 - 242
  • [34] Learning Task Priorities From Demonstrations
    Silverio, Joao
    Calinon, Sylvain
    Rozo, Leonel
    Caldwell, Darwin G.
    IEEE TRANSACTIONS ON ROBOTICS, 2019, 35 (01) : 78 - 94
  • [35] Robot Learning to Paint from Demonstrations
    Park, Younghyo
    Jeon, Seunghun
    Lee, Taeyoon
    2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 3053 - 3060
  • [36] Learning Task Specifications from Demonstrations
    Vazquez-Chanlatte, Marcell
    Jha, Susmit
    Tiwari, Ashish
    Ho, Mark K.
    Seshia, Sanjit A.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [37] Robot Learning from Failed Demonstrations
    Grollman, Daniel H.
    Billard, Aude G.
    INTERNATIONAL JOURNAL OF SOCIAL ROBOTICS, 2012, 4 (04) : 331 - 342
  • [38] Learning from Demonstration without Demonstrations
    Blau, Tom
    Morere, Philippe
    Francis, Gilad
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 4116 - 4122
  • [39] Robot Learning from Failed Demonstrations
    Daniel H. Grollman
    Aude G. Billard
    International Journal of Social Robotics, 2012, 4 : 331 - 342
  • [40] Learning a Behavioral Repertoire from Demonstrations
    Justesen, Niels
    Gonzalez-Duque, Miguel
    Cabarcas, Daniel
    Mouret, Jean-Baptiste
    Risi, Sebastian
    2020 IEEE CONFERENCE ON GAMES (IEEE COG 2020), 2020, : 383 - 390