Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

被引:3
|
作者
Xiong, Songsong [1 ]
Tziafas, Georgios [1 ]
Kasaei, Hamidreza [1 ]
机构
[1] Univ Groningen, Dept Artificial Intelligence, Groningen, Netherlands
关键词
D O I
10.1109/IROS55552.2023.10342235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
引用
收藏
页码:5751 / 5757
页数:7
相关论文
共 50 条
  • [31] BF-SAM: enhancing SAM through multi-modal fusion for fine-grained building function identification
    Gong, Zhaoya
    Li, Binbo
    Wang, Chenglong
    Chen, Jun
    Zhao, Pengjun
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024,
  • [32] ISTD-CrackNet: Hybrid CNN-transformer models focusing on fine-grained segmentation of multi-scale pavement cracks
    Zhang, Zaiyan
    Zhuang, Yangyang
    Song, Weidong
    Wu, Jiachen
    Ye, Xin
    Zhang, Hongyue
    Xu, Yanli
    Shi, Guoli
    MEASUREMENT, 2025, 251
  • [33] Hybrid transformer-CNN with boundary-awareness network for 3D medical image segmentation
    He, Jianfei
    Xu, Canhui
    APPLIED INTELLIGENCE, 2023, 53 (23) : 28542 - 28554
  • [34] Trained 3D Models for CNN based Object Recognition
    Sarkar, Kripasindhu
    Varanasi, Kiran
    Stricker, Didier
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISIGRAPP 2017), VOL 5, 2017, : 130 - 137
  • [35] LEVERAGING 2D AND 3D CUES FOR FINE-GRAINED OBJECT CLASSIFICATION
    Wang, Xiaolong
    Li, Robert
    Currey, Jon
    2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 1354 - 1358
  • [36] DCNet: exploring fine-grained vision classification for 3D point clouds
    Wu, Rusong
    Bai, Jing
    Li, Wenjing
    Jiang, Jinzhe
    VISUAL COMPUTER, 2024, 40 (02): : 781 - 797
  • [37] DCNet: exploring fine-grained vision classification for 3D point clouds
    Rusong Wu
    Jing Bai
    Wenjing Li
    Jinzhe Jiang
    The Visual Computer, 2024, 40 (2) : 781 - 797
  • [38] GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection
    Song, Ziying
    Wei, Haiyue
    Bai, Lin
    Yang, Lei
    Jia, Caiyan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3335 - 3346
  • [39] Quantization to accelerate inference in multi-modal 3D object detection
    Geerhart, Billy
    Dasari, Venkat R.
    Rapp, Brian
    Wang, Peng
    Wang, Ju
    Payne, Christopher X.
    DISRUPTIVE TECHNOLOGIES IN INFORMATION SCIENCES VIII, 2024, 13058
  • [40] MMF3: Neural Code Summarization Based on Multi-Modal Fine-Grained Feature Fusion
    Ma, Zheng
    Gao, Yuexiu
    Lyu, Lei
    Lyu, Chen
    arXiv, 2022,