JARVIS-1: Open-World Multi-Task Agents With Memory-Augmented Multimodal Language Models

被引:0
|
作者
Wang, Zihao [1 ]
Cai, Shaofei [1 ]
Liu, Anji [2 ]
Jin, Yonggang [3 ]
Hou, Jinbing [3 ]
Zhang, Bowei [1 ]
Lin, Haowei [1 ]
He, Zhaofeng [3 ]
Zheng, Zilong [4 ]
Yang, Yaodong [1 ]
Ma, Xiaojian [4 ]
Liang, Yitao [1 ]
机构
[1] Peking Univ, Inst Artificial Intelligence, Beijing 100871, Peoples R China
[2] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[3] Beijing Univ Posts & Telecommun, Sch Comp Sci, Beijing 100876, Peoples R China
[4] Beijing Inst Gen Artificial Intelligence BIGAI, Beijing 100876, Peoples R China
基金
国家重点研发计划;
关键词
Planning; Diamond; Games; Complexity theory; Cognition; Accuracy; Visualization; Reliability; Multitasking; Iron; Minecraft; multimodal language model; open-world agents;
D O I
10.1109/TPAMI.2024.3511593
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon ones, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.
引用
收藏
页码:1894 / 1907
页数:14
相关论文
共 6 条
  • [1] JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
    Wang, Zihao
    Cai, Shaofei
    Liu, Anji
    Jin, Yonggang
    Hou, Jinbing
    Zhang, Bowei
    Lin, Haowei
    He, Zhaofeng
    Zheng, Zilong
    Yang, Yaodong
    Ma, Xiaojian
    Liang, Yitao
    arXiv, 2023,
  • [2] Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
    Wang, Zihao
    Cai, Shaofei
    Chen, Guanzhou
    Liu, Anji
    Ma, Xiaojian
    Liang, Yitao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
    Sarch, Gabriel
    Wu, Yue
    Tarr, Michael J.
    Fragkiadaki, Katerina
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3468 - 3500
  • [4] SIM: Open-World Multi-Task Stream Classifier with Integral Similarity Metrics
    Gao, Yang
    Li, Yi-Fan
    Dong, Bo
    Lin, Yu
    Khan, Latifur
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 751 - 760
  • [5] Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
    Cai, Shaofei
    Wang, Zihao
    Ma, Xiaojian
    Liu, Anji
    Liang, Yitao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 13734 - 13744
  • [6] RAMIE: retrieval-augmented multi-task information extraction with large language models on dietary supplements
    Zhan, Zaifu
    Zhou, Shuang
    Li, Mingchen
    Zhang, Rui
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2025, 32 (03) : 545 - 554