ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

被引：2

作者：

Li, Xiaoqi ^{[1
]}

Zhang, Mingxu ^{[2
]}

Geng, Yiran ^{[1
]}

Geng, Haoran ^{[1
]}

Long, Yuxing ^{[1
]}

Shen, Yan ^{[1
]}

Zhang, Renrui ^{[3
]}

Liu, Jiaming ^{[1
]}

Dong, Hao ^{[1
]}

机构：

[1] Peking Univ, Sch Comp Sci, Beijing, Peoples R China

[2] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

[3] CUHK, MMLab, Hong Kong, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.01710

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming way-points in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm.

引用

页码：18061 / 18070

页数：10

共 50 条

[21] Robotic grippers for large and soft object manipulation
Takacs, Kristof
Mason, Alex
Christensen, Lars Bager
Haidegger, Tamas
2020 IEEE 20TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS (CINTI), 2020,
[22] Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations
Ahmadyan, Adel
Zhang, Liangkai
Ablavatski, Artsiom
Wei, Jianing
Grundmann, Matthias
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7818 - 7827
[23] Advancing general robotic manipulation with multimodal foundation models: An embodied AI paradigm
Shifeng Huang
He Wang
Xing Zhou
Wenkai Chen
Haibin Yang
Jianwei Zhang
Science China Technological Sciences, 2025, 68 (5)
[24] Object-Centric Behavioral Constraint Models: A Hybrid Model for Behavioral and Data Perspectives
Li, Guangming
de Carvalho, Renata Medeiros
van der Aalst, Wil M. P.
SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 48 - 56
[25] A Model-Driven Engineering Perspective for the Object-Centric Event Data (OCED) Metamodel
Calegari, Daniel
Delgado, Andrea
BUSINESS PROCESS MANAGEMENT WORKSHOPS, BPM 2023, 2024, 492 : 508 - 520
[26] Contextual Object Detection with Multimodal Large Language Models
Zang, Yuhang
Li, Wei
Han, Jun
Zhou, Kaiyang
Loy, Chen Change
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 825 - 843
[27] A Multimodal Model of Object Deformation Under Robotic Pushing
Arriola-Rios, Veronica E.
Wyatt, Jeremy L.
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2017, 9 (02) : 153 - 169
[28] A model for object representation and manipulation in a visual design language
Cox, PT
Smedley, TJ
1998 IEEE SYMPOSIUM ON VISUAL LANGUAGES, PROCEEDINGS, 1998, : 254 - 261
[29] Multi-camera scene analysis using an object-centric Continuous Distribution Hidden Markov Model
Taj, Murtaza
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-7, 2007, : 2245 - 2248
[30] Process-related user interaction logs: State of the art, reference model, and object-centric implementation
Abb, Luka
Rehse, Jana-Rebecca
INFORMATION SYSTEMS, 2024, 124

← 1 2 3 4 5 →