ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

被引:2
|
作者
Li, Xiaoqi [1 ]
Zhang, Mingxu [2 ]
Geng, Yiran [1 ]
Geng, Haoran [1 ]
Long, Yuxing [1 ]
Shen, Yan [1 ]
Zhang, Renrui [3 ]
Liu, Jiaming [1 ]
Dong, Hao [1 ]
机构
[1] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
[2] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[3] CUHK, MMLab, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.01710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming way-points in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm.
引用
收藏
页码:18061 / 18070
页数:10
相关论文
共 50 条
  • [41] Robotic In-Hand Manipulation for Large-Range Precise Object Movement: The RGMC Champion Solution
    Yu, Mingrui
    Jiang, Yongpeng
    Chen, Chen
    Jia, Yongyi
    Li, Xiang
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (05): : 4738 - 4745
  • [42] LINGUA MANGA : A Generic Large Language Model Centric System for Data Curation
    Chen, Zui
    Cao, Lei
    Madden, Sam
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 4074 - 4077
  • [43] DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model
    Song, Shezheng
    Li, Shasha
    Yu, Jie
    Zhao, Shan
    Li, Xiaopeng
    Ma, Jun
    Liu, Xiaodong
    Li, Zhuo
    Mao, Xiaoguang
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 187 - 200
  • [44] GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator
    Zhao, Henry Hengyuan
    Zhou, Pan
    Shou, Mike Zheng
    COMPUTER VISION - ECCV 2024, PT XXIII, 2025, 15081 : 129 - 147
  • [45] EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing
    Zhang, Wei
    Cai, Miaoxin
    Zhang, Tong
    Zhuang, Yin
    Li, Jun
    Mao, Xuerui
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [46] Multimodal Emotion Captioning Using Large Language Model with Prompt Engineering
    Xu, Yaoxun
    Zhou, Yixuan
    Cai, Yunrui
    Xie, Jingran
    Ye, Runchuan
    Wu, Zhiyong
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON MULTIMODAL AND RESPONSIBLE AFFECTIVE COMPUTING, MRAC 2024, 2024, : 104 - 109
  • [47] Accurate Robotic Pushing Manipulation Through Online Model Estimation Under Uncertain Object Properties
    Lee, Yongseok
    Kim, Keehoon
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (10): : 8730 - 8737
  • [48] Robotic Object Manipulation with Multilevel Part-based Model in RGB-D Data
    Li, Kun
    Meng, Max
    2014 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2014, : 3151 - 3156
  • [49] Potato disease detection and prevention using multimodal AI and large language model
    Zhu, Hongfei
    Shi, Weiming
    Guo, Xinyu
    Lyu, Shiting
    Yang, Ranbing
    Han, Zhongzhi
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2025, 229
  • [50] Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions
    Han, Tianyu
    Adams, Lisa C.
    Bressem, Keno K.
    Busch, Felix
    Nebelung, Sven
    Truhn, Daniel
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2024, 331 (15): : 1320 - 1321