The fusion of infrared and visible images aims to merge their complementary information to generate a fused output with better visual perception and scene understanding. The existing CNN-based methods typically employ convolutional operations to extract local features while failing to model the long-range relationships. On the contrary, the Transformer-based methods usually propose a self-attention mechanism to model the global dependencies, but lack the supplement of local information. More importantly, these methods often ignore the specialized interactive information learning of different modalities, which produces limited fusion performance. To address these issues, this paper introduces an infrared and visible image fusion via interactive self-attention, namely ISAFusion. First, we devise a collaborative learning scheme that seamlessly integrates CNN and Transformer. This approach leverages residual convolutional blocks to extract local features, which are then aggregated into the transformer to model the global features, thus enhancing its powerful feature representation abilities. Second, we construct a cross-modality interactive attention module, which is a cascade of Token-ViT and Channel-ViT. This module can model the long-range dependencies from token and channel dimensions in an interactive manner, and allow feature communication between spatial locations and independent channels. The generated global features markedly focus on the intrinsic characteristics of different modality images, which can effectively strengthen their complementary information to achieve better fusion performance. Finally, we end-to-end train the fusion network through a comprehensive objective function encompassing the structural similarity index measure SSIM loss, gradient loss, and intensity loss. This design can ensure the fusion model preserves similar structural information, valuable pixel intensity, and rich texture details from source images. To verify the effectiveness and superiority of the proposed method, we carry out experiments on the three different benchmarks, namely TNO, Roadscene, and (MFD)-F-3 datasets. Meanwhile, seven representative methods, namely U2Fusion, RFN-Nest, FusionGAN, GANMcC, YDTR, SwinFusion, and SwinFuse, are selected for the experimental comparisons. Eight evaluation metrics, such as average gradient, mutual information, phase congruency, feature mutual information with pixel, edge-based similarity measurement, gradient-based similarity measurement, multi-scale structural similarity index measure, and visual information fidelity, are used for the objective evaluation. In the compared experiments, ISAFusion can achieve more balanced fusion results in retaining the typical targets of the infrared image and rich texture details of the visible image, which presents a better visual effect and is more suitable for the human visual system. Meanwhile, from the objective comparison perspective, ISAFusion achieves better fusion performance than other comparable methods in the three different datasets, which is consistent with the subjective analysis. Furthermore, we also conduct experiments to evaluate the operational efficiency of different methods, and experimental results demonstrate our methods is only behind of YDTR, indicating its competitive computation efficiency. To sum up, compared with other seven state-of-the-art competitors, our method presents better image fusion performance, stronger robustness and higher computational efficiency. In addition, we carry on ablation experiments to verify the effectiveness of each designed component. The experimental results indicate that removing any of the components will degrade the fusion performance more or less. More specifically, we find that discarding the position embedding generates a positive effect on the fusion performance. The qualitative and quantitative ablation studies demonstrate the rationality and superiority of each designed component. In the further, we will exploit a more effective CNN-Transformer learning scheme to further promote the fusion performance, and extend it for other fusion tasks, such as multi-band, multi-exposure, multi-focus image fusion, and so on.