TBFNT3D: Two-Branch Fusion Network With Transformer for Multimodal Indoor 3D Object Detection

被引:0
|
作者
Cheng, Jun [1 ,2 ,3 ]
Zhang, Sheng [4 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Hong Kong Macao Joint Lab Human Machine, Shenzhen 100045, Peoples R China
[2] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen 100045, Peoples R China
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, CAS Key Lab Human Machine Intelligence Synergy Sys, Shenzhen 100045, Peoples R China
基金
中国国家自然科学基金;
关键词
3D object detection; indoor scenes; multimodal fusion; transformer;
D O I
10.1109/LRA.2023.3309133
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Indoor 3D object detection based on point clouds has been widely applied for robotics, augmented reality and virtual reality. The point clouds generated from RGB-D cameras are sparse for distant objects, which affects the detection performance. Multimodal 3D object detection can improve the detection performance by fusing features for point clouds and images. RGB images can be converted to dense 3D features, which can be applied as a complement to 3D object detection using only point clouds. We refer to the 3D data transformed from RGB images as estimated 3D data. Therefore, we propose a two-branch fusion network with a transformer for multimodal indoor 3D object detection named TBFNT3D. In TBFNT3D, voxels converted from the point clouds and images are added together to obtain a consistent voxel representation. The features for the voxel space are enriched, and features from different modalities do not require a complex alignment process. To make better use of estimated 3D data, we need to process noise and remove redundant estimated 3D data. The receptive field for 3D sparse convolution is expanded into the 2D image space, which weakens the effect of noise. A bin-based sampling strategy is applied for near objects and distant objects, removing the redundant estimated 3D data. In addition, to fuse the multimodal features efficiently, we apply a deformable transformer to obtain the detection results. Finally, TBFNT3D is evaluated on the SUN RGB-D dataset and ScanNet dataset, and state-of-the-art results are achieved.
引用
收藏
页码:6523 / 6530
页数:8
相关论文
共 50 条
  • [1] Image attention transformer network for indoor 3D object detection
    REN KeYan
    YAN Tong
    HU ZhaoXin
    HAN HongGui
    ZHANG YunLu
    [J]. Science China(Technological Sciences), 2024, (07) : 2176 - 2190
  • [2] Image attention transformer network for indoor 3D object detection
    Ren, Keyan
    Yan, Tong
    Hu, Zhaoxin
    Han, Honggui
    Zhang, Yunlu
    [J]. SCIENCE CHINA-TECHNOLOGICAL SCIENCES, 2024, 67 (07) : 2176 - 2190
  • [3] Image attention transformer network for indoor 3D object detection
    REN KeYan
    YAN Tong
    HU ZhaoXin
    HAN HongGui
    ZHANG YunLu
    [J]. Science China(Technological Sciences), 2024, 67 (07) : 2176 - 2190
  • [4] BMFN3D: Bidirectional multilayer fusion network for indoor 3D object detection
    Cheng, Jun
    Zhang, Sheng
    [J]. ELECTRONICS LETTERS, 2022, 58 (18) : 696 - 698
  • [5] Multimodal Transformer for Automatic 3D Annotation and Object Detection
    Liu, Chang
    Qian, Xiaoyan
    Huang, Binxiao
    Qi, Xiaojuan
    Lam, Edmund
    Tan, Siew-Chong
    Wong, Ngai
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 657 - 673
  • [6] Two-branch 3D convolution neural network for gait recognition
    Hui Huang
    Yuanyu Zhang
    Yuhang Si
    Jin Wang
    Dongzhi He
    [J]. Signal, Image and Video Processing, 2023, 17 : 3495 - 3504
  • [7] Two-branch 3D convolution neural network for gait recognition
    Huang, Hui
    Zhang, Yuanyu
    Si, Yuhang
    Wang, Jin
    He, Dongzhi
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (07) : 3495 - 3504
  • [8] F3DsCNN: A Fast Two-Branch 3D Separable CNN for Moving Object Detection
    Hou, Bingxin
    Liu, Ying
    Ling, Nam
    Liu, Lingzhi
    Ren, Yongxiong
    Hsu, Ming Kai
    [J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [9] Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
    Alaba, Simegnew Yihunie
    Ball, John E.
    [J]. IEEE ACCESS, 2024, 12 : 50165 - 50176
  • [10] A multilevel fusion network for 3D object detection
    Xia, Chunlong
    Wei, Ping
    Wei, Wenwen
    Zheng, Nanning
    [J]. NEUROCOMPUTING, 2021, 437 : 107 - 117