CRViT: Vision transformer advanced by causality and inductive bias for image recognition: CRViT: Vision transformer advanced by causality and inductive bias for image recognition: F. Lu et al.

被引：0

作者：

Lu, Faming ^{[1
]}

Jia, Kunhao ^{[1
]}

Zhang, Xue ^{[1
]}

Sun, Lin ^{[2
]}

机构：

[1] College of Computer Science and Engineering, Shandong University of Science and Technology, 579 Qianwan Port Road, Qingdao,266590, China

[2] College of Geodesy and Geomatics, Shandong University of Science and Technology, 579 Qianwan Port Road, Qingdao,266590, China

来源：

Applied Intelligence | 2025年 / 55卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Convolutional neural networks;

D O I：

10.1007/s10489-024-05910-3

中图分类号：

学科分类号：

摘要：

Vision Transformer (ViT) has shown powerful potential in various vision tasks by exploiting Transformer’s self-attention mechanism and global perception capability. However, to train a large number of network parameters, ViT requires a huge amount of data and number of computational resources, thus performing poorly on small and medium-sized datasets. Compared to ViT, convolutional networks maintain high accuracy despite the small amount of data due to the consideration of the inductive bias (IB). Besides, causal relationships can explore the underlying correlation of data structures, making the deep learning networks more intelligent. In this work, we propose a Causal Relationship Vision Transformer (CRViT), which refines ViT by fusing causal relationships and IB. We propose a random fourier features module that makes feature vectors independent of each other and uses convolution to learn correct correlation between feature vectors and extract causal features to introduce causal relationships in our network. The structure of convolutional downsampling significantly reduces the number of parameters of our model while introducing IB. Experimental validations underscore the data efficiency of CRViT, achieving a Top-1 accuracy of 80.6% on the ImageNet-1k dataset. This surpasses the ViT benchmark by 2.7% while concurrently reducing parameters by 92%. This enhanced performance is also consistent across smaller datasets, including T-ImageNet, CIFAR, and SVHN. We create the counterfactual dataset Colorful MNIST and experimentally demonstrate that causality is truly joined. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

引用

共 4 条

[1] ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
Zhang, Qiming
Xu, Yufei
Zhang, Jing
Tao, Dacheng
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (05) : 1141 - 1162
[2] ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond
Qiming Zhang
Yufei Xu
Jing Zhang
Dacheng Tao
International Journal of Computer Vision, 2023, 131 : 1141 - 1162
[3] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
Xu, Yufei
Zhang, Qiming
Zhang, Jing
Tao, Dacheng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
[4] A swin-transformer-based network with inductive bias ability for medical image segmentationA swin-transformer-based network with inductive...Y. Gao et al.
Yan Gao
Huan Xu
Quanle Liu
Mei Bie
Xiangjiu Che
Applied Intelligence, 2025, 55 (2)

← 1 →