Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

被引:3
|
作者
Xiong, Songsong [1 ]
Tziafas, Georgios [1 ]
Kasaei, Hamidreza [1 ]
机构
[1] Univ Groningen, Dept Artificial Intelligence, Groningen, Netherlands
关键词
D O I
10.1109/IROS55552.2023.10342235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
引用
收藏
页码:5751 / 5757
页数:7
相关论文
共 50 条
  • [41] MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion
    Ma, Zheng
    Gao, Yuexiu
    Lyu, Lei
    Lyu, Chen
    PROCEEDINGS OF THE16TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, ESEM 2022, 2022, : 171 - 182
  • [42] Unlocking the power of multi-modal fusion in 3D object tracking
    Hu, Yue
    IET COMPUTER VISION, 2025, 19 (01)
  • [43] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Wang, Yingjie
    Mao, Qiuyu
    Zhu, Hanqi
    Deng, Jiajun
    Zhang, Yu
    Ji, Jianmin
    Li, Houqiang
    Zhang, Yanyong
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (08) : 2122 - 2152
  • [44] Multi-Modal 3D Object Detection in Autonomous Driving: A Survey
    Yingjie Wang
    Qiuyu Mao
    Hanqi Zhu
    Jiajun Deng
    Yu Zhang
    Jianmin Ji
    Houqiang Li
    Yanyong Zhang
    International Journal of Computer Vision, 2023, 131 : 2122 - 2152
  • [45] Fine-grained Recognition of 3D Shapes Based on Multi-view Recurrent Neural Network
    Dong, Shuai
    Zou, Kun
    Li, Wensheng
    ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 152 - 156
  • [46] ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion
    Cai, Qi
    Pan, Yingwei
    Yao, Ting
    Ngo, Chong-Wah
    Mei, Tao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18021 - 18030
  • [47] Multi-modal 2D and 3D biometrics for face recognition
    Chang, KI
    Bowyer, KW
    Flynn, PJ
    IEEE INTERNATIONAL WORKSHOP ON ANALYSIS AND MODELING OF FACE AND GESTURES, 2003, : 187 - 194
  • [48] Cross-Level Multi-Modal Features Learning With Transformer for RGB-D Object Recognition
    Zhang, Ying
    Yin, Maoliang
    Wang, Heyong
    Hua, Changchun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7121 - 7130
  • [49] Multi-modal 3D imaging of radionuclides using multiple hybrid Compton cameras
    Akihisa Omata
    Miho Masubuchi
    Nanase Koshikawa
    Jun Kataoka
    Hiroki Kato
    Atsushi Toyoshima
    Takahiro Teramoto
    Kazuhiro Ooe
    Yuwei Liu
    Keiko Matsunaga
    Takashi Kamiya
    Tadashi Watabe
    Eku Shimosegawa
    Jun Hatazawa
    Scientific Reports, 12
  • [50] Multi-modal 3D imaging of radionuclides using multiple hybrid Compton cameras
    Omata, Akihisa
    Masubuchi, Miho
    Koshikawa, Nanase
    Kataoka, Jun
    Kato, Hiroki
    Toyoshima, Atsushi
    Teramoto, Takahiro
    Ooe, Kazuhiro
    Liu, Yuwei
    Matsunaga, Keiko
    Kamiya, Takashi
    Watabe, Tadashi
    Shimosegawa, Eku
    Hatazawa, Jun
    SCIENTIFIC REPORTS, 2022, 12 (01)