UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引:0
|
作者
Zou, Jian [1 ]
Huang, Tianyu [1 ]
Yang, Guanglei [1 ]
Guo, Zhenhua [2 ]
Luo, Tao [3 ]
Feng, Chun-Mei [3 ]
Zuo, Wangmeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China
[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore
来源
关键词
Unified representation; sensor fusion; masked autoencoders;
D O I
10.1007/978-3-031-72670-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
引用
收藏
页码:296 / 313
页数:18
相关论文
共 50 条
  • [31] A Multi-modal Framework for 3D Facial Animation Control
    Xiao, Qiuyang
    Shi, Chengwei
    Cao, Chong
    PROCEEDINGS OF THE SIGGRAPH 2024 POSTERS, 2024,
  • [32] Routing Optimization of Multi-modal Interconnects In 3D ICs
    Lee, Young-Joon
    Lim, Sung Kyu
    2009 IEEE 59TH ELECTRONIC COMPONENTS AND TECHNOLOGY CONFERENCE, VOLS 1-4, 2009, : 32 - 39
  • [33] Incremental Dense Multi-modal 3D Scene Reconstruction
    Miksik, Ondrej
    Amar, Yousef
    Vineet, Vibhav
    Perez, Patrick
    Torr, Philip H. S.
    2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2015, : 908 - 915
  • [34] SEGMENTATION OF INFLAMED SYNOVIA IN MULTI-MODAL 3D MRI
    Basso, Curzio
    Santoro, Matteo
    Verri, Alessandro
    Esposito, Mario
    2009 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: FROM NANO TO MACRO, VOLS 1 AND 2, 2009, : 229 - +
  • [35] Multi-Modal 3D Object Detection by Box Matching
    Liu, Zhe
    Ye, Xiaoqing
    Zou, Zhikang
    He, Xinwei
    Tan, Xiao
    Ding, Errui
    Wang, Jingdong
    Bai, Xiang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024,
  • [36] A new technique for multi-modal 3D image registration
    Stippel, G
    Ellsmere, J
    Warfield, SK
    Wells, WM
    Philips, W
    BIOMEDICAL IMAGE REGISTRATION, 2003, 2717 : 244 - 253
  • [37] MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis
    Liang, Yaqian
    Zhao, Shanshan
    Yu, Baosheng
    Zhang, Jing
    He, Fazhi
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 37 - 54
  • [38] A multi-modal 2D/3D registration scheme for preterm brain images
    Vandemeulebroucke, Jef
    Vansteenkiste, Ewout
    Philips, Wilfried
    2006 28TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-15, 2006, : 304 - 307
  • [39] Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
    Wang, Haowei
    Tang, Jiji
    Ji, Jiayi
    Sun, Xiaoshuai
    Zhang, Rongsheng
    Ma, Yiwei
    Zhao, Minda
    Li, Lincheng
    Zhao, Zeng
    Lv, Tangjie
    Ji, Rongrong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3403 - 3414
  • [40] Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
    Najibi, Mahyar
    Ji, Jingwei
    Zhou, Yin
    Qi, Charles R.
    Yan, Xinchen
    Ettinger, Scott
    Anguelov, Dragomir
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8568 - 8578