TBFNT3D: Two-Branch Fusion Network With Transformer for Multimodal Indoor 3D Object Detection

被引：0

作者：

Cheng, Jun ^{[1
,2
,3
]}

Zhang, Sheng ^{[4
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Hong Kong Macao Joint Lab Human Machine, Shenzhen 100045, Peoples R China

[2] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen 100045, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, CAS Key Lab Human Machine Intelligence Synergy Sys, Shenzhen 100045, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2023年 / 8卷 / 10期

基金：

中国国家自然科学基金;

关键词：

3D object detection; indoor scenes; multimodal fusion; transformer;

D O I：

10.1109/LRA.2023.3309133

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Indoor 3D object detection based on point clouds has been widely applied for robotics, augmented reality and virtual reality. The point clouds generated from RGB-D cameras are sparse for distant objects, which affects the detection performance. Multimodal 3D object detection can improve the detection performance by fusing features for point clouds and images. RGB images can be converted to dense 3D features, which can be applied as a complement to 3D object detection using only point clouds. We refer to the 3D data transformed from RGB images as estimated 3D data. Therefore, we propose a two-branch fusion network with a transformer for multimodal indoor 3D object detection named TBFNT3D. In TBFNT3D, voxels converted from the point clouds and images are added together to obtain a consistent voxel representation. The features for the voxel space are enriched, and features from different modalities do not require a complex alignment process. To make better use of estimated 3D data, we need to process noise and remove redundant estimated 3D data. The receptive field for 3D sparse convolution is expanded into the 2D image space, which weakens the effect of noise. A bin-based sampling strategy is applied for near objects and distant objects, removing the redundant estimated 3D data. In addition, to fuse the multimodal features efficiently, we apply a deformable transformer to obtain the detection results. Finally, TBFNT3D is evaluated on the SUN RGB-D dataset and ScanNet dataset, and state-of-the-art results are achieved.

引用

页码：6523 / 6530

页数：8

共 50 条

[1] Image attention transformer network for indoor 3D object detection
REN KeYan
YAN Tong
HU ZhaoXin
HAN HongGui
ZHANG YunLu
[J]. Science China(Technological Sciences), 2024, (07) : 2176 - 2190
[2] Image attention transformer network for indoor 3D object detection
Ren, Keyan
Yan, Tong
Hu, Zhaoxin
Han, Honggui
Zhang, Yunlu
[J]. SCIENCE CHINA-TECHNOLOGICAL SCIENCES, 2024, 67 (07) : 2176 - 2190
[3] Image attention transformer network for indoor 3D object detection
REN KeYan
YAN Tong
HU ZhaoXin
HAN HongGui
ZHANG YunLu
[J]. Science China(Technological Sciences), 2024, 67 (07) : 2176 - 2190
[4] BMFN3D: Bidirectional multilayer fusion network for indoor 3D object detection
Cheng, Jun
Zhang, Sheng
[J]. ELECTRONICS LETTERS, 2022, 58 (18) : 696 - 698
[5] Multimodal Transformer for Automatic 3D Annotation and Object Detection
Liu, Chang
Qian, Xiaoyan
Huang, Binxiao
Qi, Xiaojuan
Lam, Edmund
Tan, Siew-Chong
Wong, Ngai
[J]. COMPUTER VISION, ECCV 2022, PT XXXVIII, 2022, 13698 : 657 - 673
[6] Two-branch 3D convolution neural network for gait recognition
Hui Huang
Yuanyu Zhang
Yuhang Si
Jin Wang
Dongzhi He
[J]. Signal, Image and Video Processing, 2023, 17 : 3495 - 3504
[7] Two-branch 3D convolution neural network for gait recognition
Huang, Hui
Zhang, Yuanyu
Si, Yuhang
Wang, Jin
He, Dongzhi
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (07) : 3495 - 3504
[8] F3DsCNN: A Fast Two-Branch 3D Separable CNN for Moving Object Detection
Hou, Bingxin
Liu, Ying
Ling, Nam
Liu, Lingzhi
Ren, Yongxiong
Hsu, Ming Kai
[J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
[9] Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving
Alaba, Simegnew Yihunie
Ball, John E.
[J]. IEEE ACCESS, 2024, 12 : 50165 - 50176
[10] A multilevel fusion network for 3D object detection
Xia, Chunlong
Wei, Ping
Wei, Wenwen
Zheng, Nanning
[J]. NEUROCOMPUTING, 2021, 437 : 107 - 117

← 1 2 3 4 5 →