Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

被引:3
|
作者
Xiong, Songsong [1 ]
Tziafas, Georgios [1 ]
Kasaei, Hamidreza [1 ]
机构
[1] Univ Groningen, Dept Artificial Intelligence, Groningen, Netherlands
关键词
D O I
10.1109/IROS55552.2023.10342235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
引用
收藏
页码:5751 / 5757
页数:7
相关论文
共 50 条
  • [1] Fine-grained multi-modal prompt learning for vision-language models
    Liu, Yunfei
    Deng, Yunziwei
    Liu, Anqi
    Liu, Yanan
    Li, Shengyang
    NEUROCOMPUTING, 2025, 636
  • [2] Fine-Grained Context and Multi-modal Alignment for Freehand 3D Ultrasound Reconstruction
    Yan, Zhongnuo
    Yang, Xin
    Luo, Mingyuan
    Chen, Jiongquan
    Chen, Rusi
    Liu, Lian
    Ni, Dong
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT VII, 2024, 15007 : 340 - 349
  • [3] Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
    Munro, Jonathan
    Damen, Dima
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 119 - 129
  • [4] Multi-Modal Domain Adaptation for Fine-grained Action Recognition
    Munro, Jonathan
    Damen, Dima
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3723 - 3726
  • [5] Fine-grained Activities Recognition with Coarse-grained Labeled Multi-modal Data
    Hu, Zhizhang
    Yu, Tong
    Zhang, Yue
    Pan, Shijia
    UBICOMP/ISWC '20 ADJUNCT: PROCEEDINGS OF THE 2020 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2020 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS, 2020, : 644 - 649
  • [6] Learning Canonical 3D Object Representation for Fine-Grained Recognition
    Joung, Sunghun
    Kim, Seungryong
    Kim, Minsu
    Kim, Ig-Jae
    Sohn, Kwanghoon
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1015 - 1025
  • [7] Fine-Grained Recognition of Manipulation Activities on Objects via Multi-Modal Sensing
    Liu, Xiulong
    Zhang, Bojun
    Wang, Lizhang
    Chen, Sheng
    Xie, Xin
    Tong, Xinyu
    Gu, Tao
    Li, Keqiu
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2024, 23 (10) : 9614 - 9628
  • [8] Teaching robots to do object assembly using multi-modal 3D vision
    Wan, Weiwei
    Lu, Feng
    Wu, Zepei
    Harada, Kensuke
    NEUROCOMPUTING, 2017, 259 : 85 - 93
  • [9] BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition
    Sochor, Jakub
    Herout, Adam
    Havel, Jiri
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3006 - 3015
  • [10] 3D Object Representations for Fine-Grained Categorization
    Krause, Jonathan
    Stark, Michael
    Deng, Jia
    Li Fei-Fei
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2013, : 554 - 561