First return, then explore

被引:143
|
作者
Ecoffet, Adrien [1 ,2 ]
Huizinga, Joost [1 ,2 ]
Lehman, Joel [1 ,2 ]
Stanley, Kenneth O. [1 ,2 ]
Clune, Jeff [1 ,2 ]
机构
[1] Uber AI Labs, San Francisco, CA 94107 USA
[2] OpenAI, San Francisco, CA 94110 USA
关键词
ARCADE LEARNING-ENVIRONMENT; LEVEL;
D O I
10.1038/s41586-020-03157-9
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A reinforcement learning algorithm that explicitly remembers promising states and returns to them as a basis for further exploration solves all as-yet-unsolved Atari games and out-performs previous algorithms on Montezuma's Revenge and Pitfall. Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse(1) and deceptive(2) feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly 'remembering' promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games(1), with orders-of-magnitude improvements on the grand challenges of Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration-an insight that may prove critical to the creation of truly intelligent learning agents.
引用
收藏
页码:580 / 586
页数:22
相关论文
共 50 条
  • [1] First return, then explore
    Adrien Ecoffet
    Joost Huizinga
    Joel Lehman
    Kenneth O. Stanley
    Jeff Clune
    [J]. Nature, 2021, 590 : 580 - 586
  • [2] How to Explore to Maximize Future Return
    Szepesvari, Csaba
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091
  • [3] To Return or to Explore: Modelling Human Mobility and Dynamics in Cyberspace
    Hu, Tianran
    Xia, Yinglong
    Luo, Jiebo
    [J]. WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 705 - 716
  • [4] FIRST FOODS RETURN
    Viani, Lisa Owens
    [J]. LANDSCAPE ARCHITECTURE MAGAZINE, 2020, 110 (10) : 42 - 44
  • [5] Return to First Principles
    Mir, Romaana
    Rembielak, Agata
    [J]. INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2022, 113 (05): : 914 - 914
  • [6] First return approachability
    Darji, UB
    Evans, MJ
    Humke, PD
    [J]. JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1996, 199 (02) : 545 - 557
  • [7] Return to the first image
    Solomon, B
    [J]. JOURNAL OF PEACE RESEARCH, 1997, 34 (03) : 249 - 255
  • [8] Kids return (Recent films that explore the strange minds of children)
    Hampton, H
    [J]. FILM COMMENT, 1999, 35 (06) : 16 - +
  • [9] NEMO: A mission to explore and return samples from Europa's oceans
    Powell, JR
    Paniagua, JC
    Maise, G
    [J]. SPACE TECHNOLOGY AND APPLICATIONS INTERNATIONAL FORUM-STAIF 2004, 2004, 699 : 223 - 229
  • [10] Using Jackson's Return Potential Model to Explore the Normativeness of Recycling
    Nolan, Jessica M.
    [J]. ENVIRONMENT AND BEHAVIOR, 2015, 47 (08) : 835 - 855