To achieve a fused image that contains rich texture details and prominent targets, we present a progressive dual-branch infrared and visible image fusion network called PDFusion, which incorporates the Transformer module. Initially, the proposed network is divided into two branches to extract infrared and visible features independently. Subsequently, the image-wise transfer block (ITB) is introduced to fuse the infrared and visible features at different layers, facilitating the exchange of information between features. The fused features are then fed back into both pathways to contribute to the subsequent feature extraction process. Moreover, in addition to conventional pixel-level and structured loss functions, the contrastive language- image pretraining (CLIP) loss is introduced to guide the network training. Experimental results on publicly available datasets demonstrate the promising performance of PDFusion in the task of infrared and visible image fusion. The exceptional fusion performance of the proposed fusion network can be attributed to the following reasons: (1) The ITB block, particularly with the integration of the Transformer, enhances the capability of representation learning. The Transformer module captures long-range dependencies among image features, enabling a global receptive field that integrates contextual information from the entire image. This leads to a more comprehensive fusion of features. (2) The feature loss based on the CLIP image encoder minimizes the discrepancy between the generated and target images. Consequently, it promotes the generation of semantically coherent and visually appealing fused images. The source code of our method can be found at https://github.com/Changfei-Zhou/PDFusion.