Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

被引:0
|
作者
Lee, Yejin [1 ]
Lee, Suho [1 ]
Hwang, Sangheum [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea
[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 18期
基金
新加坡国家研究基金会;
关键词
self-supervised learning; fine-grained image recognition; transfer learning; Vision Transformer;
D O I
10.3390/app131810493
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Patch-level Representation Learning for Self-supervised Vision Transformers
    Yun, Sukmin
    Lee, Hankook
    Kim, Jaehyung
    Shin, Jinwoo
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8344 - 8353
  • [2] Convolutional Fine-Grained Classification With Self-Supervised Target Relation Regularization
    Liu, Kangjun
    Chen, Ke
    Jia, Kui
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5570 - 5584
  • [3] Fine-Grained Self-Supervised Learning with Jigsaw puzzles for medical image classification
    Park W.
    Ryu J.
    Comput. Biol. Med., 2024,
  • [4] Patch-wise self-supervised visual representation learning: a fine-grained approach
    Javidani, Ali
    Sadeghi, Mohammad Amin
    Araabi, Babak Nadjar
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (06)
  • [5] Convolutional Fine-Grained Classification with Self-Supervised Target Relation Regularization
    Liu, Kangjun
    Chen, Ke
    Jia, Kui
    IEEE Transactions on Image Processing, 2022, 31 : 5570 - 5584
  • [6] Siamese self-supervised learning for fine-grained visual classification
    Ji, Ruyi
    Li, Jiaying
    Zhang, Libo
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 229
  • [7] An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing
    Hu, Feiran
    Zhang, Chenlin
    Guo, Jiangliang
    Wei, Shen
    Zhao, Lin
    Xu, Anqi
    Gao, Lingyan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17648 - 17657
  • [8] HCL: Hierarchical Consistency Learning for Webly Supervised Fine-Grained Recognition
    Sun, Hongbo
    He, Xiangteng
    Peng, Yuxin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5108 - 5119
  • [9] Self-supervised facial expression recognition with fine-grained feature selection
    An, Heng-Yu
    Jia, Rui-Sheng
    VISUAL COMPUTER, 2024, 40 (10): : 7001 - 7013
  • [10] Self-supervised learning of pseudo classes for generalized zero-shot fine-grained recognition
    Chen Y.-H.
    Yeh M.-C.
    Multimedia Tools and Applications, 2025, 84 (10) : 7915 - 7930