JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引：0

作者：

Ji, Jiayi ^{[1
,2
]}

Wang, Haowei ^{[3
]}

Wu, Changli ^{[1
]}

Ma, Yiwei ^{[1
]}

Sun, Xiaoshuai ^{[1
]}

Ji, Rongrong ^{[1
]}

机构：

[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China

[2] Natl Univ Singapore, Singapore 119077, Singapore

[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2025年 / 47卷 / 04期

基金：

中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;

关键词：

Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;

D O I：

10.1109/TPAMI.2024.3523675

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.

引用

页码：2475 / 2492

页数：18

共 50 条

[21] Multi-Modal 3D Object Detection by Box Matching
Liu, Zhe
Ye, Xiaoqing
Zou, Zhikang
He, Xinwei
Tan, Xiao
Ding, Errui
Wang, Jingdong
Bai, Xiang
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024,
[22] A new technique for multi-modal 3D image registration
Stippel, G
Ellsmere, J
Warfield, SK
Wells, WM
Philips, W
BIOMEDICAL IMAGE REGISTRATION, 2003, 2717 : 244 - 253
[23] 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding
Li, Zeju
Zhang, Chao
Wang, Xiaoyan
Ren, Ruilong
Xu, Yifan
Ma, Ruifei
Liu, Xiangde
Wei, Rong
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
[24] Multi-Modal 3D Shape Clustering with Dual Contrastive Learning
Lin, Guoting
Zheng, Zexun
Chen, Lin
Qin, Tianyi
Song, Jiahui
APPLIED SCIENCES-BASEL, 2022, 12 (15):
[25] Quantization to accelerate inference in multi-modal 3D object detection
Geerhart, Billy
Dasari, Venkat R.
Rapp, Brian
Wang, Peng
Wang, Ju
Payne, Christopher X.
DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES VIII, 2024, 13058
[26] Evaluation of 3D Feature Descriptors for Multi-modal Data Registration
Kim, Hansung
Hilton, Adrian
2013 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2013), 2013, : 119 - 126
[27] 3D shape recognition based on multi-modal information fusion
Qi Liang
Mengmeng Xiao
Dan Song
Multimedia Tools and Applications, 2021, 80 : 16173 - 16184
[28] Learning Similarity Measure for Multi-Modal 3D Image Registration
Lee, Daewon
Hofmann, Matthias
Steinke, Florian
Altun, Yasemin
Cahill, Nathan D.
Schoelkopf, Bernhard
CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 186 - +
[29] Using multi-modal 3D contours and their relations for vision and robotics
Baseski, Emre
Pugeault, Nicolas
Kalkan, Sinan
Bodenhagen, Leon
Piater, Justus H.
Kruger, Norbert
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (08) : 850 - 864
[30] Multi-modal Panoramic 3D Outdoor Datasets for Place Categorization
Jung, Hojung
Oto, Yuki
Mozos, Oscar M.
Iwashita, Yumi
Kurazume, Ryo
2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 4545 - 4550

← 1 2 3 4 5 →