ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

被引:2
|
作者
Li, Xiaoqi [1 ]
Zhang, Mingxu [2 ]
Geng, Yiran [1 ]
Geng, Haoran [1 ]
Long, Yuxing [1 ]
Shen, Yan [1 ]
Zhang, Renrui [3 ]
Liu, Jiaming [1 ]
Dong, Hao [1 ]
机构
[1] Peking Univ, Sch Comp Sci, Beijing, Peoples R China
[2] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[3] CUHK, MMLab, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52733.2024.01710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming way-points in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm.
引用
收藏
页码:18061 / 18070
页数:10
相关论文
共 50 条
  • [31] Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
    Iioka, Yui
    Yoshida, Yu
    Wada, Yuiga
    Hatanaka, Shumpei
    Sugiura, Komei
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 7590 - 7597
  • [32] A medical multimodal large language model for future pandemics
    Liu, Fenglin
    Zhu, Tingting
    Wu, Xian
    Yang, Bang
    You, Chenyu
    Wang, Chenyang
    Lu, Lei
    Liu, Zhangdaihong
    Zheng, Yefeng
    Sun, Xu
    Yang, Yang
    Clifton, Lei
    Clifton, David A.
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [33] A medical multimodal large language model for future pandemics
    Fenglin Liu
    Tingting Zhu
    Xian Wu
    Bang Yang
    Chenyu You
    Chenyang Wang
    Lei Lu
    Zhangdaihong Liu
    Yefeng Zheng
    Xu Sun
    Yang Yang
    Lei Clifton
    David A. Clifton
    npj Digital Medicine, 6
  • [34] Static elastic model of a hemispherical soft fingertip for object manipulation by robotic hand
    Inoue, Takahiro
    Hirai, Shinichi
    Nihon Kikai Gakkai Ronbunshu, C Hen/Transactions of the Japan Society of Mechanical Engineers, Part C, 2006, 72 (03): : 872 - 878
  • [35] Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine
    Bhattacharya, Manojit
    Pal, Soumen
    Chatterjee, Srijan
    Lee, Sang -Soo
    Chakraborty, Chiranjib
    MOLECULAR THERAPY NUCLEIC ACIDS, 2024, 35 (03):
  • [36] LLMGA: Multimodal Large Language Model Based Generation Assistant
    Xia, Bin
    Wang, Shiyin
    Tao, Yingfan
    Wang, Yitong
    Jia, Jiaya
    COMPUTER VISION-ECCV 2024, PT XXXVIII, 2025, 15096 : 389 - 406
  • [37] Multimodal Speech Emotion Recognition Based on Large Language Model
    Fang, Congcong
    Jin, Yun
    Chen, Guanlin
    Zhang, Yunfan
    Li, Shidang
    Ma, Yong
    Xie, Yue
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (11) : 1463 - 1467
  • [38] Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
    Pi, Renjie
    Han, Tianyang
    Xiong, Wei
    Zhang, Jipeng
    Liu, Runtao
    Pan, Rui
    Zhang, Tong
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 382 - 398
  • [39] A Refer-and-Ground Multimodal Large Language Model for Biomedicine
    Huang, Xiaoshuang
    Huang, Haifeng
    Shen, Lingdong
    Yang, Yehui
    Shang, Fangxin
    Liu, Junwei
    Liu, Jia
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 399 - 409
  • [40] Soft Object Deformation Monitoring and Learning for Model-Based Robotic Hand Manipulation
    Cretu, Ana-Maria
    Payeur, Pierre
    Petriu, Emil M.
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2012, 42 (03): : 740 - 753