Open-World Object Manipulation using Pre-Trained Vision-Language Models

被引：0

作者：

Stone, Austin ^{[1
]}

Xiao, Ted ^{[1
]}

Lu, Yao ^{[1
]}

Gopalakrishnan, Keerthana ^{[1
]}

Lee, Kuang-Huei ^{[1
]}

Quan Vuong ^{[1
]}

Wohlhart, Paul ^{[1
]}

Kirmani, Sean ^{[1
]}

Zitkovich, Brianna ^{[1
]}

Xia, Fei ^{[1
]}

Finn, Chelsea ^{[1
]}

Hausman, Karol ^{[1
]}

机构：

[1] Google, Robot, Mountain View, CA 94043 USA

来源：

CONFERENCE ON ROBOT LEARNING, VOL 229 | 2023年 / 229卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

引用

页数：21

共 50 条

[31] Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
Yu, Tiezheng
Dai, Wenliang
Liu, Zihan
Fung, Pascale
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3995 - 4007
[32] MERGEDISTILL: Merging Pre-trained Language Models using Distillation
Khanuja, Simran
Johnson, Melvin
Talukdar, Partha
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2874 - 2887
[33] Automated Assessment of Inferences Using Pre-Trained Language Models
Yoo, Yongseok
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (09):
[34] Issue Report Classification Using Pre-trained Language Models
Colavito, Giuseppe
Lanubile, Filippo
Novielli, Nicole
[J]. 2022 IEEE/ACM 1ST INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE-BASED SOFTWARE ENGINEERING (NLBSE 2022), 2022, : 29 - 32
[35] Annotating Columns with Pre-trained Language Models
Suhara, Yoshihiko
Li, Jinfeng
Li, Yuliang
Zhang, Dan
Demiralp, Cagatay
Chen, Chen
Tan, Wang-Chiew
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
[36] LaoPLM: Pre-trained Language Models for Lao
Lin, Nankai
Fu, Yingwen
Yang, Ziyu
Chen, Chuwei
Jiang, Shengyi
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6506 - 6512
[37] HinPLMs: Pre-trained Language Models for Hindi
Huang, Xixuan
Lin, Nankai
Li, Kexin
Wang, Lianxi
Gan, Suifu
[J]. 2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 241 - 246
[38] PhoBERT: Pre-trained language models for Vietnamese
Dat Quoc Nguyen
Anh Tuan Nguyen
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042
[39] Knowledge Inheritance for Pre-trained Language Models
Qin, Yujia
Lin, Yankai
Yi, Jing
Zhang, Jiajie
Han, Xu
Zhang, Zhengyan
Su, Yusheng
Liu, Zhiyuan
Li, Peng
Sun, Maosong
Zhou, Jie
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937
[40] Evaluating Commonsense in Pre-Trained Language Models
Zhou, Xuhui
Zhang, Yue
Cui, Leyang
Huang, Dandan
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9733 - 9740

← 1 2 3 4 5 →