JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues

被引:0
|
作者
Ji, Jiayi [1 ,2 ]
Wang, Haowei [3 ]
Wu, Changli [1 ]
Ma, Yiwei [1 ]
Sun, Xiaoshuai [1 ]
Ji, Rongrong [1 ]
机构
[1] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China
[2] Natl Univ Singapore, Singapore 119077, Singapore
[3] Tencent, Youtu Lab, Shanghai 200000, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划; 中国博士后科学基金;
关键词
Three-dimensional displays; Solid modeling; Point cloud compression; Visualization; Representation learning; Feature extraction; Large language models; Data models; Degradation; Contrastive learning; 3D representation learning; joint multi-modal alignment; large language model; structured multimodal organizer;
D O I
10.1109/TPAMI.2024.3523675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.
引用
收藏
页码:2475 / 2492
页数:18
相关论文
共 50 条
  • [41] A Multi-Modal Fusion-Based 3D Multi-Object Tracking Framework With Joint Detection
    Wang, Xiyang
    Fu, Chunyun
    He, Jiawei
    Huang, Mingguang
    Meng, Ting
    Zhang, Siyu
    Zhou, Hangning
    Xu, Ziyao
    Zhang, Chi
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (01): : 532 - 539
  • [42] InfraDet3D: Multi-Modal 3D Object Detection based on Roadside Infrastructure Camera and LiDAR Sensors
    Zimmer, Walter
    Birkner, Joseph
    Brucker, Marcel
    Nguyen, Huu Tung
    Petrovski, Stefan
    Wang, Bohan
    Knoll, Alois C.
    2023 IEEE INTELLIGENT VEHICLES SYMPOSIUM, IV, 2023,
  • [43] Global Multi-modal 2D/3D Registration via Local Descriptors Learning
    Markova, Viktoria
    Ronchetti, Matteo
    Wein, Wolfgang
    Zettinig, Oliver
    Prevost, Raphael
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VI, 2022, 13436 : 269 - 279
  • [44] Cross Diffusion on Multi-hypergraph for Multi-modal 3D Object Recognition
    Zhang, Zizhao
    Lin, Haojie
    Zhu, Junjie
    Zhao, Xibin
    Gao, Yue
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 38 - 49
  • [45] Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving
    Chiu, Hsu-kuang
    Lie, Jie
    Ambrus, Rares
    Bohg, Jeannette
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14227 - 14233
  • [46] Exploiting Multi-Modal Synergies for Enhancing 3D Multi-Object Tracking
    Xu, Xinglong
    Ren, Weihong
    Chen, Xi'ai
    Fan, Huijie
    Han, Zhi
    Liu, Honghai
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (10): : 8643 - 8650
  • [47] Deep multi-scale and multi-modal fusion for 3D object detection
    Guo, Rui
    Li, Deng
    Han, Yahong
    PATTERN RECOGNITION LETTERS, 2021, 151 : 236 - 242
  • [48] HYBRID 3D FEATURE DESCRIPTION AND MATCHING FOR MULTI-MODAL DATA REGISTRATION
    Kim, Hansung
    Hilton, Adrian
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 3493 - 3497
  • [49] Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection
    Li, Xin
    Shi, Botian
    Hou, Yuenan
    Wu, Xingjiao
    Ma, Tianlong
    Li, Yikang
    He, Liang
    COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 691 - 707
  • [50] A 3D Generative Model of Pathological Multi-modal MR Images and Segmentations
    Fernandez, Virginia
    Pinaya, Walter Hugo Lopez
    Borges, Pedro
    Graham, Mark S.
    Vercauteren, Tom
    Cardoso, M. Jorge
    DEEP GENERATIVE MODELS, DGM4MICCAI 2023, 2024, 14533 : 132 - 142