In infrared and visible image fusion, recent proposed methods predominantly employ self-attention mechanism to explore the long-range dependencies among image features, aiming to mitigate the loss of global relationships. However, these methods primarily concentrate on capturing the dependencies within modality and pay minimal attention to cross-modal interaction, leading to unsatisfactory global contrast in the fusion results. Furthermore, weak inductive bias of the self-attention mechanism constrains its ability to capture local features, potentially leading to a loss of details and texture in the fused image. In this article, we explore a simple but effective network structure to equivalently model long-range dependencies and propose a cross-modal global feature mixing network called CMFuse. Specifically, we propose intra- and inter-modality mixing modules (Intra-Conv-MLP and Inter-Conv-MLP), which consist of residual multilayer perceptron (MLP) and depthwise separable convolutions. Our modules are designed to extract and integrate complementary information within and between modalities, leveraging the global receptive field of MLP. Moreover, multiple residual dense blocks (RDBs) are also employed to enhance ability of our network in extracting local fine-grained features, thereby enriching textures in fusion images. Extensive experiments demonstrate that CMFuse outperforms existing state-of-the-art methods. Furthermore, our model significantly enhances the performance of high-level vision tasks. Our code and pre-trained model will be published at https://github.com/zc617/Conv-MLP-Fusion.