TBFNT3D: Two-Branch Fusion Network With Transformer for Multimodal Indoor 3D Object Detection

被引:0
|
作者
Cheng, Jun [1 ,2 ,3 ]
Zhang, Sheng [4 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Hong Kong Macao Joint Lab Human Machine, Shenzhen 100045, Peoples R China
[2] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen 100045, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, CAS Key Lab Human Machine Intelligence Synergy Sys, Shenzhen 100045, Peoples R China
基金
中国国家自然科学基金;
关键词
3D object detection; indoor scenes; multimodal fusion; transformer;
D O I
10.1109/LRA.2023.3309133
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Indoor 3D object detection based on point clouds has been widely applied for robotics, augmented reality and virtual reality. The point clouds generated from RGB-D cameras are sparse for distant objects, which affects the detection performance. Multimodal 3D object detection can improve the detection performance by fusing features for point clouds and images. RGB images can be converted to dense 3D features, which can be applied as a complement to 3D object detection using only point clouds. We refer to the 3D data transformed from RGB images as estimated 3D data. Therefore, we propose a two-branch fusion network with a transformer for multimodal indoor 3D object detection named TBFNT3D. In TBFNT3D, voxels converted from the point clouds and images are added together to obtain a consistent voxel representation. The features for the voxel space are enriched, and features from different modalities do not require a complex alignment process. To make better use of estimated 3D data, we need to process noise and remove redundant estimated 3D data. The receptive field for 3D sparse convolution is expanded into the 2D image space, which weakens the effect of noise. A bin-based sampling strategy is applied for near objects and distant objects, removing the redundant estimated 3D data. In addition, to fuse the multimodal features efficiently, we apply a deformable transformer to obtain the detection results. Finally, TBFNT3D is evaluated on the SUN RGB-D dataset and ScanNet dataset, and state-of-the-art results are achieved.
引用
收藏
页码:6523 / 6530
页数:8
相关论文
共 50 条
  • [31] DMFF: dual-way multimodal feature fusion for 3D object detection
    Dong, Xiaopeng
    Di, Xiaoguang
    Wang, Wenzhuang
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (01) : 455 - 463
  • [32] A Multimodal 3D Object Detection Method Based on Double-Fusion Framework
    Ge T.-A.
    Li H.
    Guo Y.
    Wang J.-Y.
    Zhou D.
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (11): : 3100 - 3110
  • [33] Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection
    Alaba, Simegnew Yihunie
    Gurbuz, Ali C.
    Ball, John E.
    [J]. WORLD ELECTRIC VEHICLE JOURNAL, 2024, 15 (01):
  • [34] DMFF: dual-way multimodal feature fusion for 3D object detection
    Xiaopeng Dong
    Xiaoguang Di
    Wenzhuang Wang
    [J]. Signal, Image and Video Processing, 2024, 18 (1) : 455 - 463
  • [35] TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection
    Cao, Feng
    Jin, Yufeng
    Tao, Chongben
    Luo, Xizhao
    Gao, Zhen
    Zhang, Zufeng
    Zheng, Sifa
    Zhu, Yuan
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73
  • [36] MF-Net: Meta Fusion Network for 3D object detection
    Meng, Zhaoxin
    Luo, Guiyang
    Yuan, Quan
    Li, Jinglin
    Yang, Fangchun
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [37] Cascaded Cross-Modality Fusion Network for 3D Object Detection
    Chen, Zhiyu
    Lin, Qiong
    Sun, Jing
    Feng, Yujian
    Liu, Shangdong
    Liu, Qiang
    Ji, Yimu
    Xu, He
    [J]. SENSORS, 2020, 20 (24) : 1 - 14
  • [38] Multimodal Fusion Network for 3-D Lane Detection
    Liu, Taiheng
    Cao, Guang-Zhong
    He, Zhaoshui
    Xie, Shengli
    Deng, Xiuqin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [39] PointPainting: Sequential Fusion for 3D Object Detection
    Vora, Sourabh
    Lang, Alex H.
    Helou, Bassam
    Beijbom, Oscar
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4603 - 4611
  • [40] Dense Voxel Fusion for 3D Object Detection
    Mahmoud, Anas
    Hu, Jordan S. K.
    Waslander, Steven L.
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 663 - 672