UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引:0
|
作者
Zou, Jian [1 ]
Huang, Tianyu [1 ]
Yang, Guanglei [1 ]
Guo, Zhenhua [2 ]
Luo, Tao [3 ]
Feng, Chun-Mei [3 ]
Zuo, Wangmeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China
[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore
来源
关键词
Unified representation; sensor fusion; masked autoencoders;
D O I
10.1007/978-3-031-72670-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
引用
收藏
页码:296 / 313
页数:18
相关论文
共 50 条
  • [21] Multi-Modal Streaming 3D Object Detection
    Abdelfattah, Mazen
    Yuan, Kaiwen
    Wang, Z. Jane
    Ward, Rabab
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6163 - 6170
  • [22] HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving
    Zanfir, Andrei
    Zanfir, Mihai
    Gorban, Alexander
    Ji, Jingwei
    Zhou, Yin
    Anguelov, Dragomir
    Sminchisescu, Cristian
    CONFERENCE ON ROBOT LEARNING, VOL 205, 2022, 205 : 1114 - 1124
  • [23] JM3D & JM3D-LLM: Elevating 3D Representation With Joint Multi-Modal Cues
    Ji, Jiayi
    Wang, Haowei
    Wu, Changli
    Ma, Yiwei
    Sun, Xiaoshuai
    Ji, Rongrong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 2475 - 2492
  • [24] SAMR: Symmetric masked multimodal modeling for general multi-modal 3D motion retrieval
    Li, Yunhao
    Wu, Sijing
    Zhu, Yucheng
    Sun, Wei
    Zhang, Zhichao
    Song, Song
    Zhai, Guangtao
    DISPLAYS, 2025, 87
  • [25] A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition
    Bowyer, KW
    Chang, K
    Flynn, P
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2006, 101 (01) : 1 - 15
  • [26] Enhancing 3D object detection through multi-modal fusion for cooperative perception
    Xia, Bin
    Zhou, Jun
    Kong, Fanyu
    You, Yuhe
    Yang, Jiarui
    Lin, Lin
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 104 : 46 - 55
  • [27] RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception
    Li, Chunliang
    Han, Wencheng
    Yin, Junbo
    Zhao, Sanyuan
    Shen, Jianbing
    COMPUTER VISION - ECCV 2024, PT XXXII, 2025, 15090 : 273 - 292
  • [28] Personalized FedM2former: An Innovative Approach Towards Federated Multi-Modal 3D Object Detection for Autonomous Driving
    Zhao, Liang
    Li, Xuan
    Jia, Xin
    Fu, Lulu
    PROCESSES, 2025, 13 (02)
  • [29] Omni Viewer : Enabling Multi-modal 3D DASH
    Gao, Zhenhuan
    Chen, Shannon
    Nahrstedt, Klara
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 801 - 802
  • [30] Multi-modal 3D Simulation Makes the Impossible Possible
    Ganske, Ingrid M.
    Schulz, Noah
    Livingston, Katie
    Goobie, Susan
    Meara, John G.
    Proctor, Mark
    Weinstock, Peter
    PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN, 2018, 6 (04)