Vision Transformers have shown overwhelming superiority in computer vision communities compared with convolutional neural networks. Nevertheless, the understanding of multi-head self attentions, as the de facto ingredient of Transformers, is still limited, which leads to surging interest in explaining its core ideology. A notable theory interprets that, unlike high-frequency sensitive convolutions, self-attention be-haves like a generalized spatial smoothing and blurs the high spatial-frequency signals with depth in-creasing. In this paper, we design a Conv-MSA structure to extract efficient local contextual information and remedy the inherent drawback of self-attention. Accordingly, a flexible transformer structure named FlexFormer, with linear computational complexity on input image size, is proposed. Experimental results on several visual recognition benchmarks show that our FlexFormer achieved the state-of-the-art results on visual recognition tasks with fewer parameters and higher computational efficiency. (c) 2023 Elsevier B.V. All rights reserved.