Open-World Object Manipulation using Pre-Trained Vision-Language Models

被引：0

作者：

Stone, Austin ^{[1
]}

Xiao, Ted ^{[1
]}

Lu, Yao ^{[1
]}

Gopalakrishnan, Keerthana ^{[1
]}

Lee, Kuang-Huei ^{[1
]}

Quan Vuong ^{[1
]}

Wohlhart, Paul ^{[1
]}

Kirmani, Sean ^{[1
]}

Zitkovich, Brianna ^{[1
]}

Xia, Fei ^{[1
]}

Finn, Chelsea ^{[1
]}

Hausman, Karol ^{[1
]}

机构：

[1] Google, Robot, Mountain View, CA 94043 USA

来源：

CONFERENCE ON ROBOT LEARNING, VOL 229 | 2023年 / 229卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

引用

页数：21

共 50 条

[1] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
Santini, Cristian
Posthumus, Etienne
Tietz, Tabea
Tan, Mary Ann
Bruns, Oleksandra
Sack, Harald
[J]. 2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
[2] Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Zhang, Peng-Fei
Huang, Zi
Bai, Guangdong
[J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 862 - 871
[3] CPT: Colorful Prompt Tuning for pre-trained vision-language models
Yao, Yuan
Zhang, Ao
Zhang, Zhengyan
Liu, Zhiyuan
Chua, Tat-Seng
Sun, Maosong
[J]. AI OPEN, 2024, 5 : 30 - 38
[4] p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Wu, Haoyuan
Zhang, Xinyun
Xu, Peng
Liao, Peiyu
Yao, Xufeng
Yu, Bei
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6003 - 6011
[5] Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models
Wu, Qiong
Yu, Wei
Zhou, Yiyi
Huang, Shubin
Sun, Xiaoshuai
Ji, Rongrong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[6] Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors
Kawaharazuka, Kento
Obinata, Yoshiki
Kanazawa, Naoaki
Okada, Kei
Inaba, Masayuki
[J]. 2023 IEEE-RAS 22ND INTERNATIONAL CONFERENCE ON HUMANOID ROBOTS, HUMANOIDS, 2023,
[7] Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-Box Optimization
Kawaharazuka, Kento
Kanazawa, Naoaki
Obinata, Yoshiki
Okada, Kei
Inaba, Masayuki
[J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (05) : 4059 - 4066
[8] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Yin, Ziyi
Ye, Muchao
Zhang, Tianrong
Du, Tianyu
Zhu, Jinguo
Liu, Han
Chen, Jinghui
Wang, Ting
Ma, Fenglong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
Zheng, Kecheng
Wu, Wei
Feng, Ruili
Zhu, Kai
Liu, Jiawei
Zhao, Deli
Zha, Zheng-Jun
Chen, Wei
Shen, Yujun
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
[10] Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation
Li, Qi
[J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1308 - 1317

← 1 2 3 4 5 →