JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引:0
|
作者
Ji, Jiayi [1 ,2 ]
Wang, Haowei [3 ]
Wu, Changli [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China
[2] Natl Univ Singapore, Singapore 119077, Singapore
[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;
关键词
Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;
D O I
10.1109/TPAMI.2024.3523675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
引用
收藏
页码:2475 / 2492
页数:18
相关论文
共 50 条
  • [21] Multi-Modal 3D Object Detection by Box Matching
    Liu, Zhe
    Ye, Xiaoqing
    Zou, Zhikang
    He, Xinwei
    Tan, Xiao
    Ding, Errui
    Wang, Jingdong
    Bai, Xiang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024,
  • [22] A new technique for multi-modal 3D image registration
    Stippel, G
    Ellsmere, J
    Warfield, SK
    Wells, WM
    Philips, W
    BIOMEDICAL IMAGE REGISTRATION, 2003, 2717 : 244 - 253
  • [23] 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding
    Li, Zeju
    Zhang, Chao
    Wang, Xiaoyan
    Ren, Ruilong
    Xu, Yifan
    Ma, Ruifei
    Liu, Xiangde
    Wei, Rong
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [24] Multi-Modal 3D Shape Clustering with Dual Contrastive Learning
    Lin, Guoting
    Zheng, Zexun
    Chen, Lin
    Qin, Tianyi
    Song, Jiahui
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [25] Quantization to accelerate inference in multi-modal 3D object detection
    Geerhart, Billy
    Dasari, Venkat R.
    Rapp, Brian
    Wang, Peng
    Wang, Ju
    Payne, Christopher X.
    DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES VIII, 2024, 13058
  • [26] Evaluation of 3D Feature Descriptors for Multi-modal Data Registration
    Kim, Hansung
    Hilton, Adrian
    2013 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2013), 2013, : 119 - 126
  • [27] 3D shape recognition based on multi-modal information fusion
    Qi Liang
    Mengmeng Xiao
    Dan Song
    Multimedia Tools and Applications, 2021, 80 : 16173 - 16184
  • [28] Learning Similarity Measure for Multi-Modal 3D Image Registration
    Lee, Daewon
    Hofmann, Matthias
    Steinke, Florian
    Altun, Yasemin
    Cahill, Nathan D.
    Schoelkopf, Bernhard
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 186 - +
  • [29] Using multi-modal 3D contours and their relations for vision and robotics
    Baseski, Emre
    Pugeault, Nicolas
    Kalkan, Sinan
    Bodenhagen, Leon
    Piater, Justus H.
    Kruger, Norbert
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (08) : 850 - 864
  • [30] Multi-modal Panoramic 3D Outdoor Datasets for Place Categorization
    Jung, Hojung
    Oto, Yuki
    Mozos, Oscar M.
    Iwashita, Yumi
    Kurazume, Ryo
    2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 4545 - 4550