3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding

被引:1
|
作者
Li, Zeju [1 ]
Zhang, Chao [1 ]
Wang, Xiaoyan [1 ]
Ren, Ruilong [1 ]
Xu, Yifan [1 ]
Ma, Ruifei [1 ]
Liu, Xiangde [1 ]
Wei, Rong [1 ]
机构
[1] Beijing Digital Nat Digital City Res Ctr, Beijing, Peoples R China
关键词
3D-LLMs; 3D Scene Understanding;
D O I
10.1109/ICMEW63481.2024.10645462
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The remarkable potential of multi-modal large language models (MLLMs) in comprehending both vision and language information has been widely acknowledged. However, the scarcity of 3D scenes-language pairs in comparison to their 2D counterparts, coupled with the inadequacy of existing approaches in understanding of 3D scenes by LLMs, poses a significant challenge. In response, we collect and construct an extensive dataset comprising 75K instruction-response pairs tailored for 3D scenes. This dataset addresses tasks related to 3D VQA, grounding, and caption. To further enhance the integration of 3D spatial information into LLMs, we introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information including the entire scene and segmented objects. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain and find that our approach serves as a strategic means to enrich LLMs' comprehension of the 3D world. Our code is available at https://github.com/staymylove/3DMIT.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Incremental Dense Multi-modal 3D Scene Reconstruction
    Miksik, Ondrej
    Amar, Yousef
    Vineet, Vibhav
    Perez, Patrick
    Torr, Philip H. S.
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 908 - 915
  • [2] A scene representation based on multi-modal 2D and 3D features
    Baseski, Emre
    Pugeault, Nicolas
    Kalkan, Sinan
    Kraft, Dirk
    Woergoetter, Florentin
    Krueger, Norbert
    2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6, 2007, : 63 - +
  • [3] TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
    Zhang, Zhihao
    Cao, Shengcao
    Wang, Yu-Xiong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 21413 - 21423
  • [4] Multi-Modal Multi-Task Joint 2D and 3D Scene Perception and Localization
    Xu, Dan
    PROCEEDINGS OF THE 4TH INTERNATIONAL WORKSHOP ON HUMAN-CENTRIC MULTIMEDIA ANALYSIS, HCMA 2023, 2023, : 3 - 3
  • [5] OmniViewer: Multi-modal Monoscopic 3D DASH
    Gao, Zhenhuan
    Chen, Shannon
    Nahrstedt, Klara
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 449 - 452
  • [6] Multi-Modal Streaming 3D Object Detection
    Abdelfattah, Mazen
    Yuan, Kaiwen
    Wang, Z. Jane
    Ward, Rabab
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6163 - 6170
  • [7] Understanding multi-modal brain network data: An immersive 3D visualization approach
    Pester B.
    Russig B.
    Winke O.
    Ligges C.
    Dachselt R.
    Gumhold S.
    Computers and Graphics (Pergamon), 2022, 106 : 88 - 97
  • [8] Bi-stage multi-modal 3D instance segmentation method for production workshop scene
    Tang, Zaizuo
    Chen, Guangzhu
    Han, Yinhe
    Liao, Xiaojuan
    Ru, Qingjun
    Wu, Yuanyuan
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 112
  • [9] Omni Viewer : Enabling Multi-modal 3D DASH
    Gao, Zhenhuan
    Chen, Shannon
    Nahrstedt, Klara
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 801 - 802
  • [10] Multi-modal 3D Simulation Makes the Impossible Possible
    Ganske, Ingrid M.
    Schulz, Noah
    Livingston, Katie
    Goobie, Susan
    Meara, John G.
    Proctor, Mark
    Weinstock, Peter
    PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN, 2018, 6 (04)