CRViT: Vision transformer advanced by causality and inductive bias for image recognition: CRViT: Vision transformer advanced by causality and inductive bias for image recognition: F. Lu et al.

被引:0
|
作者
Lu, Faming [1 ]
Jia, Kunhao [1 ]
Zhang, Xue [1 ]
Sun, Lin [2 ]
机构
[1] College of Computer Science and Engineering, Shandong University of Science and Technology, 579 Qianwan Port Road, Qingdao,266590, China
[2] College of Geodesy and Geomatics, Shandong University of Science and Technology, 579 Qianwan Port Road, Qingdao,266590, China
基金
中国国家自然科学基金;
关键词
Convolutional neural networks;
D O I
10.1007/s10489-024-05910-3
中图分类号
学科分类号
摘要
Vision Transformer (ViT) has shown powerful potential in various vision tasks by exploiting Transformer’s self-attention mechanism and global perception capability. However, to train a large number of network parameters, ViT requires a huge amount of data and number of computational resources, thus performing poorly on small and medium-sized datasets. Compared to ViT, convolutional networks maintain high accuracy despite the small amount of data due to the consideration of the inductive bias (IB). Besides, causal relationships can explore the underlying correlation of data structures, making the deep learning networks more intelligent. In this work, we propose a Causal Relationship Vision Transformer (CRViT), which refines ViT by fusing causal relationships and IB. We propose a random fourier features module that makes feature vectors independent of each other and uses convolution to learn correct correlation between feature vectors and extract causal features to introduce causal relationships in our network. The structure of convolutional downsampling significantly reduces the number of parameters of our model while introducing IB. Experimental validations underscore the data efficiency of CRViT, achieving a Top-1 accuracy of 80.6% on the ImageNet-1k dataset. This surpasses the ViT benchmark by 2.7% while concurrently reducing parameters by 92%. This enhanced performance is also consistent across smaller datasets, including T-ImageNet, CIFAR, and SVHN. We create the counterfactual dataset Colorful MNIST and experimentally demonstrate that causality is truly joined. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
相关论文
共 4 条
  • [1] ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
    Zhang, Qiming
    Xu, Yufei
    Zhang, Jing
    Tao, Dacheng
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (05) : 1141 - 1162
  • [2] ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
    Qiming Zhang
    Yufei Xu
    Jing Zhang
    Dacheng Tao
    International Journal of Computer Vision, 2023, 131 : 1141 - 1162
  • [3] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
    Xu, Yufei
    Zhang, Qiming
    Zhang, Jing
    Tao, Dacheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [4] A swin-transformer-based network with inductive bias ability for medical image segmentationA swin-transformer-based network with inductive...Y. Gao et al.
    Yan Gao
    Huan Xu
    Quanle Liu
    Mei Bie
    Xiangjiu Che
    Applied Intelligence, 2025, 55 (2)