Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

被引:0
|
作者
Lee, Yejin [1 ]
Lee, Suho [1 ]
Hwang, Sangheum [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea
[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 18期
基金
新加坡国家研究基金会;
关键词
self-supervised learning; fine-grained image recognition; transfer learning; Vision Transformer;
D O I
10.3390/app131810493
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] From WSI-level to patch-level: Structure prior-guided binuclear cell fine-grained detection
    Hu, Geng
    Wang, Baomin
    Hu, Boxian
    Chen, Dan
    Hu, Lihua
    Li, Cheng
    An, Yu
    Hu, Guiping
    Jia, Guang
    MEDICAL IMAGE ANALYSIS, 2023, 89
  • [42] Spatiotemporal consistency enhancement self-supervised representation learning for action recognition
    Bi, Shuai
    Hu, Zhengping
    Zhao, Mengyao
    Li, Shufang
    Sun, Zhe
    SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (04) : 1485 - 1492
  • [43] Consistency self-supervised learning method for robust automatic speech recognition
    Gao, Changfeng
    Cheng, Gaofeng
    Zhang, Pengyuan
    Shengxue Xuebao/Acta Acustica, 2023, 48 (03): : 578 - 587
  • [44] Spatiotemporal consistency enhancement self-supervised representation learning for action recognition
    Shuai Bi
    Zhengping Hu
    Mengyao Zhao
    Shufang Li
    Zhe Sun
    Signal, Image and Video Processing, 2023, 17 : 1485 - 1492
  • [45] Image denoising for fluorescence microscopy by supervised to self-supervised transfer learning
    Wang, Yina
    Pinkard, Henry
    Khwaja, Emaad
    Zhou, Shuqin
    Waller, Laura
    Huang, Bo
    OPTICS EXPRESS, 2021, 29 (25) : 41303 - 41312
  • [46] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
    Liu, Dichao
    Wang, Yu
    Kato, Jien
    VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
  • [47] Self-supervised multi-scale semantic consistency regularization for unsupervised image-to-image translation
    Zhang, Heng
    Yang, Yi-Jun
    Zeng, Wei
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
  • [48] Accuracy improvement for fine-grained image classification with semi-supervised learning
    Yu, Lei
    Cheng, Le
    Zhang, Jinli
    Zhu, Hongna
    Gao, Xiaorong
    2019 ASIA COMMUNICATIONS AND PHOTONICS CONFERENCE (ACP), 2019,
  • [49] Attention-based supervised contrastive learning on fine-grained image classification
    Li, Qian
    Wu, Weining
    PATTERN ANALYSIS AND APPLICATIONS, 2024, 27 (03)
  • [50] Object and attribute recognition for product image with self-supervised learning
    Dai, Yong
    Li, Yi
    Sun, Bin
    NEUROCOMPUTING, 2023, 558