First return, then explore

被引：143

作者：

Ecoffet, Adrien ^{[1
,2
]}

Huizinga, Joost ^{[1
,2
]}

Lehman, Joel ^{[1
,2
]}

Stanley, Kenneth O. ^{[1
,2
]}

Clune, Jeff ^{[1
,2
]}

机构：

[1] Uber AI Labs, San Francisco, CA 94107 USA

[2] OpenAI, San Francisco, CA 94110 USA

来源：

NATURE | 2021年 / 590卷 / 7847期

关键词：

ARCADE LEARNING-ENVIRONMENT; LEVEL;

D O I：

10.1038/s41586-020-03157-9

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

A reinforcement learning algorithm that explicitly remembers promising states and returns to them as a basis for further exploration solves all as-yet-unsolved Atari games and out-performs previous algorithms on Montezuma's Revenge and Pitfall. Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse(1) and deceptive(2) feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can do so remains one of the central challenges of the field. Here we hypothesize that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states (detachment) and failing to first return to a state before exploring from it (derailment). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly 'remembering' promising states and returning to such states before intentionally exploring. Go-Explore solves all previously unsolved Atari games and surpasses the state of the art on all hard-exploration games(1), with orders-of-magnitude improvements on the grand challenges of Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration-an insight that may prove critical to the creation of truly intelligent learning agents.

引用

页码：580 / 586

页数：22

共 50 条

[1] First return, then explore
Adrien Ecoffet
Joost Huizinga
Joel Lehman
Kenneth O. Stanley
Jeff Clune
[J]. Nature, 2021, 590 : 580 - 586
[2] How to Explore to Maximize Future Return
Szepesvari, Csaba
[J]. ADVANCES IN ARTIFICIAL INTELLIGENCE (AI 2015), 2015, 9091
[3] To Return or to Explore: Modelling Human Mobility and Dynamics in Cyberspace
Hu, Tianran
Xia, Yinglong
Luo, Jiebo
[J]. WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 705 - 716
[4] FIRST FOODS RETURN
Viani, Lisa Owens
[J]. LANDSCAPE ARCHITECTURE MAGAZINE, 2020, 110 (10) : 42 - 44
[5] Return to First Principles
Mir, Romaana
Rembielak, Agata
[J]. INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2022, 113 (05): : 914 - 914
[6] First return approachability
Darji, UB
Evans, MJ
Humke, PD
[J]. JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1996, 199 (02) : 545 - 557
[7] Return to the first image
Solomon, B
[J]. JOURNAL OF PEACE RESEARCH, 1997, 34 (03) : 249 - 255
[8] Kids return (Recent films that explore the strange minds of children)
Hampton, H
[J]. FILM COMMENT, 1999, 35 (06) : 16 - +
[9] NEMO: A mission to explore and return samples from Europa's oceans
Powell, JR
Paniagua, JC
Maise, G
[J]. SPACE TECHNOLOGY AND APPLICATIONS INTERNATIONAL FORUM-STAIF 2004, 2004, 699 : 223 - 229
[10] Using Jackson's Return Potential Model to Explore the Normativeness of Recycling
Nolan, Jessica M.
[J]. ENVIRONMENT AND BEHAVIOR, 2015, 47 (08) : 835 - 855

← 1 2 3 4 5 →