UniM2 AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

被引:0
|
作者
Zou, Jian [1 ]
Huang, Tianyu [1 ]
Yang, Guanglei [1 ]
Guo, Zhenhua [2 ]
Luo, Tao [3 ]
Feng, Chun-Mei [3 ]
Zuo, Wangmeng [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
[2] Tianyijiaotong Technol Ltd, Suzhou, Peoples R China
[3] ASTAR, Inst High Performance Comp IHPC, Singapore, Singapore
来源
关键词
Unified representation; sensor fusion; masked autoencoders;
D O I
10.1007/978-3-031-72670-5_17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM(2) AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM(2) AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2% NDS and 6.5% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
引用
收藏
页码:296 / 313
页数:18
相关论文
共 50 条
  • [1] Multi-modal Relation Distillation for Unified 3D Representation Learning
    Wang, Huiqun
    Bao, Yiping
    Pan, Panwang
    Li, Zeming
    Liu, Xiao
    Yang, Ruijie
    Huang, Di
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 364 - 381
  • [2] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Wang, Yingjie
    Mao, Qiuyu
    Zhu, Hanqi
    Deng, Jiajun
    Zhang, Yu
    Ji, Jianmin
    Li, Houqiang
    Zhang, Yanyong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (08) : 2122 - 2152
  • [3] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Yingjie Wang
    Qiuyu Mao
    Hanqi Zhu
    Jiajun Deng
    Yu Zhang
    Jianmin Ji
    Houqiang Li
    Yanyong Zhang
    International Journal of Computer Vision, 2023, 131 : 2122 - 2152
  • [4] MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
    Li, Jiale
    Dai, Hang
    Han, Hao
    Ding, Yong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21694 - 21704
  • [5] OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
    Wang, Guoqing
    Wang, Zhongdao
    Tang, Pin
    Zheng, Jilai
    Ren, Xiangxuan
    Feng, Bailan
    Ma, Chao
    COMPUTER VISION - ECCV 2024, PT XX, 2025, 15078 : 95 - 112
  • [6] Improving Deep Multi-modal 3D Object Detection for Autonomous Driving
    Khamsehashari, Razieh
    Schill, Kerstin
    2021 7TH INTERNATIONAL CONFERENCE ON AUTOMATION, ROBOTICS AND APPLICATIONS (ICARA 2021), 2021, : 263 - 267
  • [7] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy
    Wang, Li
    Zhang, Xinyu
    Song, Ziying
    Bi, Jiangfeng
    Zhang, Guoxin
    Wei, Haiyue
    Tang, Liyao
    Yang, Lei
    Li, Jun
    Jia, Caiyan
    Zhao, Lijun
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2023, 8 (07): : 3781 - 3798
  • [8] Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving
    Chiu, Hsu-kuang
    Lie, Jie
    Ambrus, Rares
    Bohg, Jeannette
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14227 - 14233
  • [9] A scene representation based on multi-modal 2D and 3D features
    Baseski, Emre
    Pugeault, Nicolas
    Kalkan, Sinan
    Kraft, Dirk
    Woergoetter, Florentin
    Krueger, Norbert
    2007 IEEE 11TH INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS 1-6, 2007, : 63 - +
  • [10] Multi-modal 3D Human Pose Estimation with 2DWeak Supervision in Autonomous Driving
    Zheng, Jingxiao
    Shi, Xinwei
    Gorban, Alexander
    Mao, Junhua
    Song, Yang
    Qi, Charles R.
    Liu, Ting
    Chari, Visesh
    Cornman, Andre
    Zhou, Yin
    Li, Congcong
    Anguelov, Dragomir
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4477 - 4486