UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引:0
|
作者
Zou, Jian [1 ]
Huang, Tianyu [1 ]
Yang, Guanglei [1 ]
Guo, Zhenhua [2 ]
Luo, Tao [3 ]
Feng, Chun-Mei [3 ]
Zuo, Wangmeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China
[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore
来源
关键词
Unified representation; sensor fusion; masked autoencoders;
D O I
10.1007/978-3-031-72670-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
引用
收藏
页码:296 / 313
页数:18
相关论文
共 50 条
  • [41] MLF3D: Multi-Level Fusion for Multi-Modal 3D Object Detection
    Jiang, Han
    Wang, Jianbin
    Xiao, Jianru
    Zhao, Yanan
    Chen, Wanqing
    Ren, Yilong
    Yu, Haiyang
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 1588 - 1593
  • [42] A unified representation for interactive 3D modeling
    Tubic, D
    Hébert, P
    Deschênes, JD
    Laurendean, D
    2ND INTERNATIONAL SYMPOSIUM ON 3D DATA PROCESSING, VISUALIZATION, AND TRANSMISSION, PROCEEDINGS, 2004, : 175 - 182
  • [43] Multi-Modal 3D Shape Clustering with Dual Contrastive Learning
    Lin, Guoting
    Zheng, Zexun
    Chen, Lin
    Qin, Tianyi
    Song, Jiahui
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [44] Quantization to accelerate inference in multi-modal 3D object detection
    Geerhart, Billy
    Dasari, Venkat R.
    Rapp, Brian
    Wang, Peng
    Wang, Ju
    Payne, Christopher X.
    DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES VIII, 2024, 13058
  • [45] Evaluation of 3D Feature Descriptors for Multi-modal Data Registration
    Kim, Hansung
    Hilton, Adrian
    2013 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2013), 2013, : 119 - 126
  • [46] 3D shape recognition based on multi-modal information fusion
    Qi Liang
    Mengmeng Xiao
    Dan Song
    Multimedia Tools and Applications, 2021, 80 : 16173 - 16184
  • [47] Learning Similarity Measure for Multi-Modal 3D Image Registration
    Lee, Daewon
    Hofmann, Matthias
    Steinke, Florian
    Altun, Yasemin
    Cahill, Nathan D.
    Schoelkopf, Bernhard
    CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 186 - +
  • [48] MMJN: Multi-Modal Joint Networks for 3D Shape Recognition
    Nie, Weizhi
    Liang, Qi
    Liu, An-An
    Mao, Zhendong
    Li, Yangyang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 908 - 916
  • [49] Using multi-modal 3D contours and their relations for vision and robotics
    Baseski, Emre
    Pugeault, Nicolas
    Kalkan, Sinan
    Bodenhagen, Leon
    Piater, Justus H.
    Kruger, Norbert
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (08) : 850 - 864
  • [50] Multi-modal Panoramic 3D Outdoor Datasets for Place Categorization
    Jung, Hojung
    Oto, Yuki
    Mozos, Oscar M.
    Iwashita, Yumi
    Kurazume, Ryo
    2016 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2016), 2016, : 4545 - 4550