JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引:0
|
作者
Ji, Jiayi [1 ,2 ]
Wang, Haowei [3 ]
Wu, Changli [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China
[2] Natl Univ Singapore, Singapore 119077, Singapore
[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;
关键词
Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;
D O I
10.1109/TPAMI.2024.3523675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
引用
收藏
页码:2475 / 2492
页数:18
相关论文
共 50 条
  • [1] Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
    Wang, Haowei
    Tang, Jiji
    Ji, Jiayi
    Sun, Xiaoshuai
    Zhang, Rongsheng
    Ma, Yiwei
    Zhao, Minda
    Li, Lincheng
    Zhao, Zeng
    Lv, Tangjie
    Ji, Rongrong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3403 - 3414
  • [2] A scene representation based on multi-modal 2D and 3D features
    Baseski, Emre
    Pugeault, Nicolas
    Kalkan, Sinan
    Kraft, Dirk
    Woergoetter, Florentin
    Krueger, Norbert
    2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6, 2007, : 63 - +
  • [3] Multi-modal Relation Distillation for Unified 3D Representation Learning
    Wang, Huiqun
    Bao, Yiping
    Pan, Panwang
    Li, Zeming
    Liu, Xiao
    Yang, Ruijie
    Huang, Di
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 364 - 381
  • [4] MMJN: Multi-Modal Joint Networks for 3D Shape Recognition
    Nie, Weizhi
    Liang, Qi
    Liu, An-An
    Mao, Zhendong
    Li, Yangyang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 908 - 916
  • [5] OmniViewer: Multi-modal Monoscopic 3D DASH
    Gao, Zhenhuan
    Chen, Shannon
    Nahrstedt, Klara
    2015 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), 2015, : 449 - 452
  • [6] Multi-Modal Streaming 3D Object Detection
    Abdelfattah, Mazen
    Yuan, Kaiwen
    Wang, Z. Jane
    Ward, Rabab
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6163 - 6170
  • [7] Multi-Modal Multi-Task Joint 2D and 3D Scene Perception and Localization
    Xu, Dan
    PROCEEDINGS OF THE 4TH INTERNATIONAL WORKSHOP ON HUMAN-CENTRIC MULTIMEDIA ANALYSIS, HCMA 2023, 2023, : 3 - 3
  • [8] MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
    Li, Jiale
    Dai, Hang
    Han, Hao
    Ding, Yong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21694 - 21704
  • [9] A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition
    Bowyer, KW
    Chang, K
    Flynn, P
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2006, 101 (01) : 1 - 15
  • [10] Multi-modal 2D and 3D biometrics for face recognition
    Chang, KI
    Bowyer, KW
    Flynn, PJ
    IEEE INTERNATIONAL WORKSHOP ON ANALYSIS AND MODELING OF FACE AND GESTURES, 2003, : 187 - 194