Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

被引:3
|
作者
Xiong, Songsong [1 ]
Tziafas, Georgios [1 ]
Kasaei, Hamidreza [1 ]
机构
[1] Univ Groningen, Dept Artificial Intelligence, Groningen, Netherlands
关键词
D O I
10.1109/IROS55552.2023.10342235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
引用
收藏
页码:5751 / 5757
页数:7
相关论文
共 50 条
  • [21] Multi-Modal Streaming 3D Object Detection
    Abdelfattah, Mazen
    Yuan, Kaiwen
    Wang, Z. Jane
    Ward, Rabab
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (10) : 6163 - 6170
  • [22] Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
    Huang, Linyan
    Li, Zhiqi
    Sima, Chonghao
    Wang, Wenhai
    Wang, Jingdong
    Qiao, Yu
    Li, Hongyang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] Enhancing 3D object detection through multi-modal fusion for cooperative perception
    Xia, Bin
    Zhou, Jun
    Kong, Fanyu
    You, Yuhe
    Yang, Jiarui
    Lin, Lin
    ALEXANDRIA ENGINEERING JOURNAL, 2024, 104 : 46 - 55
  • [24] CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection
    Zhang, Yanan
    Chen, Jiaxin
    Huang, Di
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 898 - 907
  • [25] Drive&Act: A Multi-modal Dataset for Fine-grained Driver Behavior Recognition in Autonomous Vehicles
    Martin, Manuel
    Roitberg, Alina
    Haurilet, Monica
    Horne, Matthias
    Reiss, Simon
    Voit, Michael
    Stiefelhagen, Rainer
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2801 - 2810
  • [26] A Refined 3D Pose Dataset for Fine-Grained Object Categories
    Wang, Yaming
    Tan, Xiao
    Yang, Yi
    Li, Ziyu
    Liu, Xiao
    Zhou, Feng
    Davis, Larry S.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 2797 - 2806
  • [27] Anytime 3D Object Reconstruction Using Multi-Modal Variational Autoencoder
    Yu, Hyeonwoo
    Oh, Jean
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (02) : 2162 - 2169
  • [28] Multi-Modal 3D Object Detection by Box Matching
    Liu, Zhe
    Ye, Xiaoqing
    Zou, Zhikang
    He, Xinwei
    Tan, Xiao
    Ding, Errui
    Wang, Jingdong
    Bai, Xiang
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024,
  • [29] Automatic Fine-Grained BIM element classification using Multi-Modal deep learning (MMDL)
    Liu, Hao
    Gan, Vincent J. L.
    Cheng, Jack C. P.
    Zhou, Shanjing
    ADVANCED ENGINEERING INFORMATICS, 2024, 61
  • [30] Hybrid transformer-CNN with boundary-awareness network for 3D medical image segmentation
    Jianfei He
    Canhui Xu
    Applied Intelligence, 2023, 53 : 28542 - 28554