JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引:0
|
作者
Ji, Jiayi [1 ,2 ]
Wang, Haowei [3 ]
Wu, Changli [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China
[2] Natl Univ Singapore, Singapore 119077, Singapore
[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;
关键词
Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;
D O I
10.1109/TPAMI.2024.3523675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
引用
收藏
页码:2475 / 2492
页数:18
相关论文
共 50 条
  • [31] 3D shape recognition based on multi-modal information fusion
    Liang, Qi
    Xiao, Mengmeng
    Song, Dan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16173 - 16184
  • [32] A multi-modal 2D/3D registration scheme for preterm brain images
    Vandemeulebroucke, Jef
    Vansteenkiste, Ewout
    Philips, Wilfried
    2006 28TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-15, 2006, : 304 - 307
  • [33] Unlocking the power of multi-modal fusion in 3D object tracking
    Hu, Yue
    IET COMPUTER VISION, 2025, 19 (01)
  • [34] Multi-modal Non-linear Continuous 3D Presentations
    Rodin, Daniil
    Elber, Gershon
    2020 4TH INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2020), 2020, 1518
  • [35] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Wang, Yingjie
    Mao, Qiuyu
    Zhu, Hanqi
    Deng, Jiajun
    Zhang, Yu
    Ji, Jianmin
    Li, Houqiang
    Zhang, Yanyong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (08) : 2122 - 2152
  • [36] TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
    Zhang, Zhihao
    Cao, Shengcao
    Wang, Yu-Xiong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 21413 - 21423
  • [37] BOOSTED METRIC LEARNING FOR 3D MULTI-MODAL DEFORMABLE REGISTRATION
    Michel, Fabrice
    Bronstein, Michael
    Bronstein, Alex
    Paragios, Nikos
    2011 8TH IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: FROM NANO TO MACRO, 2011, : 1209 - 1214
  • [38] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Yingjie Wang
    Qiuyu Mao
    Hanqi Zhu
    Jiajun Deng
    Yu Zhang
    Jianmin Ji
    Houqiang Li
    Yanyong Zhang
    International Journal of Computer Vision, 2023, 131 : 2122 - 2152
  • [39] MultiCAD: Contrastive Representation Learning for Multi-modal 3D Computer-Aided Design Models
    Ma, Weijian
    Xu, Minyang
    Li, Xueyang
    Zhou, Xiangdong
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 1766 - 1776
  • [40] Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object Retrieval
    Feng, Yifan
    Ji, Shuyi
    Liu, Yu-Shen
    Du, Shaoyi
    Dai, Qionghai
    Gao, Yue
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (04) : 2206 - 2223