TBFNT3D: Two-Branch Fusion Network With Transformer for Multimodal Indoor 3D Object Detection

被引：0

作者：

Cheng, Jun ^{[1
,2
,3
]}

Zhang, Sheng ^{[4
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Hong Kong Macao Joint Lab Human Machine, Shenzhen 100045, Peoples R China

[2] Univ Chinese Acad Sci, Shenzhen Coll Adv Technol, Shenzhen 100045, Peoples R China

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, CAS Key Lab Human Machine Intelligence Synergy Sys, Shenzhen 100045, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2023年 / 8卷 / 10期

基金：

中国国家自然科学基金;

关键词：

3D object detection; indoor scenes; multimodal fusion; transformer;

D O I：

10.1109/LRA.2023.3309133

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Indoor 3D object detection based on point clouds has been widely applied for robotics, augmented reality and virtual reality. The point clouds generated from RGB-D cameras are sparse for distant objects, which affects the detection performance. Multimodal 3D object detection can improve the detection performance by fusing features for point clouds and images. RGB images can be converted to dense 3D features, which can be applied as a complement to 3D object detection using only point clouds. We refer to the 3D data transformed from RGB images as estimated 3D data. Therefore, we propose a two-branch fusion network with a transformer for multimodal indoor 3D object detection named TBFNT3D. In TBFNT3D, voxels converted from the point clouds and images are added together to obtain a consistent voxel representation. The features for the voxel space are enriched, and features from different modalities do not require a complex alignment process. To make better use of estimated 3D data, we need to process noise and remove redundant estimated 3D data. The receptive field for 3D sparse convolution is expanded into the 2D image space, which weakens the effect of noise. A bin-based sampling strategy is applied for near objects and distant objects, removing the redundant estimated 3D data. In addition, to fuse the multimodal features efficiently, we apply a deformable transformer to obtain the detection results. Finally, TBFNT3D is evaluated on the SUN RGB-D dataset and ScanNet dataset, and state-of-the-art results are achieved.

引用

页码：6523 / 6530

页数：8

共 50 条

[31] DMFF: dual-way multimodal feature fusion for 3D object detection
Dong, Xiaopeng
Di, Xiaoguang
Wang, Wenzhuang
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (01) : 455 - 463
[32] A Multimodal 3D Object Detection Method Based on Double-Fusion Framework
Ge T.-A.
Li H.
Guo Y.
Wang J.-Y.
Zhou D.
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (11): : 3100 - 3110
[33] Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection
Alaba, Simegnew Yihunie
Gurbuz, Ali C.
Ball, John E.
[J]. WORLD ELECTRIC VEHICLE JOURNAL, 2024, 15 (01):
[34] DMFF: dual-way multimodal feature fusion for 3D object detection
Xiaopeng Dong
Xiaoguang Di
Wenzhuang Wang
[J]. Signal, Image and Video Processing, 2024, 18 (1) : 455 - 463
[35] TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection
Cao, Feng
Jin, Yufeng
Tao, Chongben
Luo, Xizhao
Gao, Zhen
Zhang, Zufeng
Zheng, Sifa
Zhu, Yuan
[J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73
[36] MF-Net: Meta Fusion Network for 3D object detection
Meng, Zhaoxin
Luo, Guiyang
Yuan, Quan
Li, Jinglin
Yang, Fangchun
[J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[37] Cascaded Cross-Modality Fusion Network for 3D Object Detection
Chen, Zhiyu
Lin, Qiong
Sun, Jing
Feng, Yujian
Liu, Shangdong
Liu, Qiang
Ji, Yimu
Xu, He
[J]. SENSORS, 2020, 20 (24) : 1 - 14
[38] Multimodal Fusion Network for 3-D Lane Detection
Liu, Taiheng
Cao, Guang-Zhong
He, Zhaoshui
Xie, Shengli
Deng, Xiuqin
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
[39] PointPainting: Sequential Fusion for 3D Object Detection
Vora, Sourabh
Lang, Alex H.
Helou, Bassam
Beijbom, Oscar
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4603 - 4611
[40] Dense Voxel Fusion for 3D Object Detection
Mahmoud, Anas
Hu, Jordan S. K.
Waslander, Steven L.
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 663 - 672

← 1 2 3 4 5 →