Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

被引：0

作者：

Lee, Yejin ^{[1
]}

Lee, Suho ^{[1
]}

Hwang, Sangheum ^{[1
,2
,3
]}

机构：

[1] Seoul Natl Univ Sci & Technol, Dept Data Sci, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Ind & Informat Syst Engn, Seoul 01811, South Korea

[3] Seoul Natl Univ Sci & Technol, Res Ctr Elect & Informat Technol, Seoul 01811, South Korea

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 18期

基金：

新加坡国家研究基金会;

关键词：

self-supervised learning; fine-grained image recognition; transfer learning; Vision Transformer;

D O I：

10.3390/app131810493

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.

引用

页数：14

共 50 条

[41] From WSI-level to patch-level: Structure prior-guided binuclear cell fine-grained detection
Hu, Geng
Wang, Baomin
Hu, Boxian
Chen, Dan
Hu, Lihua
Li, Cheng
An, Yu
Hu, Guiping
Jia, Guang
MEDICAL IMAGE ANALYSIS, 2023, 89
[42] Spatiotemporal consistency enhancement self-supervised representation learning for action recognition
Bi, Shuai
Hu, Zhengping
Zhao, Mengyao
Li, Shufang
Sun, Zhe
SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (04) : 1485 - 1492
[43] Consistency self-supervised learning method for robust automatic speech recognition
Gao, Changfeng
Cheng, Gaofeng
Zhang, Pengyuan
Shengxue Xuebao/Acta Acustica, 2023, 48 (03): : 578 - 587
[44] Spatiotemporal consistency enhancement self-supervised representation learning for action recognition
Shuai Bi
Zhengping Hu
Mengyao Zhao
Shufang Li
Zhe Sun
Signal, Image and Video Processing, 2023, 17 : 1485 - 1492
[45] Image denoising for fluorescence microscopy by supervised to self-supervised transfer learning
Wang, Yina
Pinkard, Henry
Khwaja, Emaad
Zhou, Shuqin
Waller, Laura
Huang, Bo
OPTICS EXPRESS, 2021, 29 (25) : 41303 - 41312
[46] Supervised Spatial Transformer Networks for Attention Learning in Fine-grained Action Recognition
Liu, Dichao
Wang, Yu
Kato, Jien
VISAPP: PROCEEDINGS OF THE 14TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 4, 2019, : 311 - 318
[47] Self-supervised multi-scale semantic consistency regularization for unsupervised image-to-image translation
Zhang, Heng
Yang, Yi-Jun
Zeng, Wei
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 241
[48] Accuracy improvement for fine-grained image classification with semi-supervised learning
Yu, Lei
Cheng, Le
Zhang, Jinli
Zhu, Hongna
Gao, Xiaorong
2019 ASIA COMMUNICATIONS AND PHOTONICS CONFERENCE (ACP), 2019,
[49] Attention-based supervised contrastive learning on fine-grained image classification
Li, Qian
Wu, Weining
PATTERN ANALYSIS AND APPLICATIONS, 2024, 27 (03)
[50] Object and attribute recognition for product image with self-supervised learning
Dai, Yong
Li, Yi
Sun, Bin
NEUROCOMPUTING, 2023, 558

← 1 2 3 4 5 →