A Hierarchical Vision Transformer Using Overlapping Patch and Self-Supervised Learning

被引:0
|
作者
Ma, Yaxin [1 ]
Li, Ming [1 ]
Chang, Jun [2 ]
机构
[1] Wuhan Univ, Natl Engn Res Ctr Multimedia Software, Sch Comp Sci, Wuhan, Peoples R China
[2] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China
关键词
Computer Vision; Image Classification; Vision Transformer; Self-supervised learning;
D O I
10.1109/IJCNN54540.2023.10191916
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based network architectures have gradually replaced convolutional neural networks in computer vision. Compared with convolutional neural networks, Transformer is able to learn global information of images and has better feature extraction capability. However, due to the lack of inductive bias, vision Transformers require a large amount of data for pre-training, such as ViT. Local-based Transformers effectively reduce the computational complexity, but could not establish long-range dependencies and do not perform as well on small-scale datasets. In response to these problems, OPSe Transformer is proposed. A global attention calculation module is designed to be added behind each stage of the vision Transformer, using a slightly larger and overlapping key patch and value patch to enhance the exchange of information between two adjacent windows and to aggregate global information in the local Transformer. In addition, a self-supervised learning proxy task is added to the architecture, corresponding to the loss function of the proxy task to constrain the training of the model on the dataset, so that the vision Transformer can learn spatial information within an image and improve the training effect of the network. Comparative experiments are conducted on the tiny-ImageNet, CIFAR-10/100, and other datasets, and the experimental results show that compared with the baseline algorithm, our model improves the accuracy by up to 3.91%.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Patch-level Representation Learning for Self-supervised Vision Transformers
    Yun, Sukmin
    Lee, Hankook
    Kim, Jaehyung
    Shin, Jinwoo
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8344 - 8353
  • [2] Scene Interpretation Method using Transformer and Self-supervised Learning
    Kobayashi, Yuya
    Suzuki, Masahiro
    Matsuo, Yutaka
    [J]. Transactions of the Japanese Society for Artificial Intelligence, 2022, 37 (02):
  • [3] Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning
    Chen, Richard J.
    Chen, Chengkuan
    Li, Yicong
    Chen, Tiffany Y.
    Trister, Andrew D.
    Krishnan, Rahul G.
    Mahmood, Faisal
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16123 - 16134
  • [4] Exploiting temporal coherence for self-supervised visual tracking by using vision transformer
    Zhu, Wenjun
    Wang, Zuyi
    Xu, Li
    Meng, Jun
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [5] Self-supervised approach for diabetic retinopathy severity detection using vision transformer
    Ohri, Kriti
    Kumar, Mukesh
    Sukheja, Deepak
    [J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, 2024, : 165 - 183
  • [6] Vision Transformer-Based Self-supervised Learning for Ulcerative Colitis Grading in Colonoscopy
    Pyatha, Ajay
    Xu, Ziang
    Ali, Sharib
    [J]. DATA ENGINEERING IN MEDICAL IMAGING, DEMI 2023, 2023, 14314 : 102 - 110
  • [7] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    [J]. PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [8] MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
    Zhao, Chaoqiang
    Zhang, Youmin
    Poggi, Matteo
    Tosi, Fabio
    Guo, Xianda
    Zhu, Zheng
    Huang, Guan
    Tang, Yang
    Mattoccia, Stefano
    [J]. 2022 INTERNATIONAL CONFERENCE ON 3D VISION, 3DV, 2022, : 668 - 678
  • [9] Multi-scale vision transformer classification model with self-supervised learning and dilated convolution
    Xing, Liping
    Jin, Hongmei
    Li, Hong-an
    Li, Zhanli
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 103
  • [10] Self-supervised Video Transformer
    Ranasinghe, Kanchana
    Naseer, Muzammal
    Khan, Salman
    Khan, Fahad Shahbaz
    Ryoo, Michael S.
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2864 - 2874