Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

被引：3

作者：

Xiong, Songsong ^{[1
]}

Tziafas, Georgios ^{[1
]}

Kasaei, Hamidreza ^{[1
]}

机构：

[1] Univ Groningen, Dept Artificial Intelligence, Groningen, Netherlands

来源：

2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS) | 2023年

关键词：

D O I：

10.1109/IROS55552.2023.10342235

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.

引用

页码：5751 / 5757

页数：7

共 50 条

[41] MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion
Ma, Zheng
Gao, Yuexiu
Lyu, Lei
Lyu, Chen
PROCEEDINGS OF THE16TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, ESEM 2022, 2022, : 171 - 182
[42] Unlocking the power of multi-modal fusion in 3D object tracking
Hu, Yue
IET COMPUTER VISION, 2025, 19 (01)
[43] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
Wang, Yingjie
Mao, Qiuyu
Zhu, Hanqi
Deng, Jiajun
Zhang, Yu
Ji, Jianmin
Li, Houqiang
Zhang, Yanyong
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (08) : 2122 - 2152
[44] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
Yingjie Wang
Qiuyu Mao
Hanqi Zhu
Jiajun Deng
Yu Zhang
Jianmin Ji
Houqiang Li
Yanyong Zhang
International Journal of Computer Vision, 2023, 131 : 2122 - 2152
[45] Fine-grained Recognition of 3D Shapes Based on Multi-view Recurrent Neural Network
Dong, Shuai
Zou, Kun
Li, Wensheng
ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 152 - 156
[46] ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion
Cai, Qi
Pan, Yingwei
Yao, Ting
Ngo, Chong-Wah
Mei, Tao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18021 - 18030
[47] Multi-modal 2D and 3D biometrics for face recognition
Chang, KI
Bowyer, KW
Flynn, PJ
IEEE INTERNATIONAL WORKSHOP ON ANALYSIS AND MODELING OF FACE AND GESTURES, 2003, : 187 - 194
[48] Cross-Level Multi-Modal Features Learning With Transformer for RGB-D Object Recognition
Zhang, Ying
Yin, Maoliang
Wang, Heyong
Hua, Changchun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7121 - 7130
[49] Multi-modal 3D imaging of radionuclides using multiple hybrid Compton cameras
Akihisa Omata
Miho Masubuchi
Nanase Koshikawa
Jun Kataoka
Hiroki Kato
Atsushi Toyoshima
Takahiro Teramoto
Kazuhiro Ooe
Yuwei Liu
Keiko Matsunaga
Takashi Kamiya
Tadashi Watabe
Eku Shimosegawa
Jun Hatazawa
Scientific Reports, 12
[50] Multi-modal 3D imaging of radionuclides using multiple hybrid Compton cameras
Omata, Akihisa
Masubuchi, Miho
Koshikawa, Nanase
Kataoka, Jun
Kato, Hiroki
Toyoshima, Atsushi
Teramoto, Takahiro
Ooe, Kazuhiro
Liu, Yuwei
Matsunaga, Keiko
Kamiya, Takashi
Watabe, Tadashi
Shimosegawa, Eku
Hatazawa, Jun
SCIENTIFIC REPORTS, 2022, 12 (01)

← 1 2 3 4 5 →