A Hierarchical Vision Transformer Using Overlapping Patch and Self-Supervised Learning

被引：0

作者：

Ma, Yaxin ^{[1
]}

Li, Ming ^{[1
]}

Chang, Jun ^{[2
]}

机构：

[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Wuhan, Peoples R China

[2] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

Computer Vision; Image Classification; Vision Transformer; Self-supervised learning;

D O I：

10.1109/IJCNN54540.2023.10191916

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-based network architectures have gradually replaced convolutional neural networks in computer vision. Compared with convolutional neural networks, Transformer is able to learn global information of images and has better feature extraction capability. However, due to the lack of inductive bias, vision Transformers require a large amount of data for pre-training, such as ViT. Local-based Transformers effectively reduce the computational complexity, but could not establish long-range dependencies and do not perform as well on small-scale datasets. In response to these problems, OPSe Transformer is proposed. A global attention calculation module is designed to be added behind each stage of the vision Transformer, using a slightly larger and overlapping key patch and value patch to enhance the exchange of information between two adjacent windows and to aggregate global information in the local Transformer. In addition, a self-supervised learning proxy task is added to the architecture, corresponding to the loss function of the proxy task to constrain the training of the model on the dataset, so that the vision Transformer can learn spatial information within an image and improve the training effect of the network. Comparative experiments are conducted on the tiny-ImageNet, CIFAR-10/100, and other datasets, and the experimental results show that compared with the baseline algorithm, our model improves the accuracy by up to 3.91%.

引用

页数：7

共 50 条

[1] Patch-level Representation Learning for Self-supervised Vision Transformers
Yun, Sukmin
Lee, Hankook
Kim, Jaehyung
Shin, Jinwoo
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8344 - 8353
[2] Scene Interpretation Method using Transformer and Self-supervised Learning
Kobayashi, Yuya
Suzuki, Masahiro
Matsuo, Yutaka
[J]. Transactions of the Japanese Society for Artificial Intelligence, 2022, 37 (02):
[3] Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning
Chen, Richard J.
Chen, Chengkuan
Li, Yicong
Chen, Tiffany Y.
Trister, Andrew D.
Krishnan, Rahul G.
Mahmood, Faisal
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16123 - 16134
[4] Exploiting temporal coherence for self-supervised visual tracking by using vision transformer
Zhu, Wenjun
Wang, Zuyi
Xu, Li
Meng, Jun
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
[5] Self-supervised approach for diabetic retinopathy severity detection using vision transformer
Ohri, Kriti
Kumar, Mukesh
Sukheja, Deepak
[J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, 2024, : 165 - 183
[6] Vision Transformer-Based Self-supervised Learning for Ulcerative Colitis Grading in Colonoscopy
Pyatha, Ajay
Xu, Ziang
Ali, Sharib
[J]. DATA ENGINEERING IN MEDICAL IMAGING, DEMI 2023, 2023, 14314 : 102 - 110
[7] Self-supervised representation learning using multimodal Transformer for emotion recognition
Goetz, Theresa
Arora, Pulkit
Erick, F. X.
Holzer, Nina
Sawant, Shrutika
[J]. PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
[8] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Zhao, Chaoqiang
Zhang, Youmin
Poggi, Matteo
Tosi, Fabio
Guo, Xianda
Zhu, Zheng
Huang, Guan
Tang, Yang
Mattoccia, Stefano
[J]. 2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
[9] Multi-scale vision transformer classification model with self-supervised learning and dilated convolution
Xing, Liping
Jin, Hongmei
Li, Hong-an
Li, Zhanli
[J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 103
[10] Self-supervised Video Transformer
Ranasinghe, Kanchana
Naseer, Muzammal
Khan, Salman
Khan, Fahad Shahbaz
Ryoo, Michael S.
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874

← 1 2 3 4 5 →