Adaptive Hybrid Vision Transformer for Small Datasets

被引:0
|
作者
Yin, Mingjun [1 ]
Chang, Zhiyong [2 ]
Wang, Yan [3 ]
机构
[1] Univ Melbourne, Melbourne, Vic, Australia
[2] Peking Univ, Beijing, Peoples R China
[3] Xiaochuan Chuhai, Beijing, Peoples R China
关键词
Vision Transformer; Small Dataset; Self-Attention;
D O I
10.1109/ICTAI59109.2023.00132
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, vision Transformers (ViTs) have achieved competitive performance on many computer vision tasks. However, vision Transformers show impaired performance on small datasets when training from scratch compared with Convolutional Neural Networks (CNNs), which is interpreted as the lack of locality inductive bias. This impedes the application of vision Transformers for small-size datasets. In this work, we propose Adaptive Hybrid Vision Transformer (AHVT) as the solution to boost the performance of vision Transformers on small-scale datasets. Specifically, on spatial dimension, we exploit a Convolutional Overlapping Patch Embedding (COPE) layer to inject desirable inductive bias in model, forcing the model to learn the local token features. On channel dimension, we insert a adaptive channel features aggregation block into vanilla feed forward network to calibrate channel responses. Meanwhile, we add several extra learnable "cardinality tokens" to patch token sequences to capture cross-channel interaction. We present extensive experiments to validate the effectiveness of our method on five small/medium datasets including CIFAR10/100, SVHN, Tiny-ImageNet and ImageNet-1k. Our approach attains state-of-the-art performance on above four small datasets when training from scratch.
引用
收藏
页码:873 / 880
页数:8
相关论文
共 50 条
  • [31] Hybrid Vision Transformer for Domain Adaptable Person Re-identification
    Waseem, Muhammad Danish
    Tahir, Muhammad Atif
    Durrani, Muhammad Nouman
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE (ICCCI 2021), 2021, 1463 : 114 - 122
  • [32] Facial Expression Recognition Based on Vision Transformer with Hybrid Local Attention
    Tian, Yuan
    Zhu, Jingxuan
    Yao, Huang
    Chen, Di
    APPLIED SCIENCES-BASEL, 2024, 14 (15):
  • [33] CSFNet: a compact and efficient convolution-transformer hybrid vision model
    Feng, Jian
    Wu, Peng
    Xu, Renjie
    Zhang, Xiaoming
    Wang, Tao
    Li, Xuan
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (29) : 72679 - 72699
  • [34] Vision Transformer With Hybrid Shifted Windows for Gastrointestinal Endoscopy Image Classification
    Wang, Wei
    Yang, Xin
    Tang, Jinhui
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4452 - 4461
  • [35] Hybrid Transformer for Lesion Segmentation on Adaptive Optics Retinal Images
    Liu, Jianfei
    Li, Joanne
    Wolde, Amday
    Cukras, Catherine
    Tam, Johnny
    MEDICAL IMAGING 2022: COMPUTER-AIDED DIAGNOSIS, 2022, 12033
  • [36] Transfer Learning Methods as a New Approach in Computer Vision Tasks with Small Datasets
    Brodzicki, Andrzej
    Piekarski, Michal
    Kucharski, Dariusz
    Jaworek-Korjakowska, Joanna
    Gorgon, Marek
    FOUNDATIONS OF COMPUTING AND DECISION SCIENCES, 2020, 45 (03) : 179 - 193
  • [37] Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
    Lu, Zhiying
    Xie, Hongtao
    Liu, Chuanbin
    Zhang, Yongdong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [38] An End-to-End Video Coding Method via Adaptive Vision Transformer
    Yang, Haoyan
    Zhou, Mingliang
    Shang, Zhaowei
    Pu, Huayan
    Luo, Jun
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Wei, Xuekai
    Xian, Weizhi
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [39] BinaryFormer: A Hierarchical-Adaptive Binary Vision Transformer (ViT) for Efficient Computing
    Wang, Miaohui
    Xu, Zhuowei
    Zheng, Bin
    Xie, Wuyuan
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10657 - 10668
  • [40] ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
    Kamal Zakieldin
    Radwa Khattab
    Ehab Ibrahim
    Esraa Arafat
    Nehal Ahmed
    Elsayed Hemayed
    International Journal of Computational Intelligence Systems, 17