Modern high-resolution remote sensing images have achieved remarkable results in change detection with the aid of convolutional neural network (CNN ). However, the limited receptive field of convolution operations leads to insufficient learning of global context and long-distance-distance spatial relationships. While visual Transformers effectively capture dependencies in remote features, their handling of details in image changes is insufficient, resulting in limited spatial localization capabilities and low computational efficiency. To address these issues, this paper proposes a multi-level cross-layer linear fusion end-to-end encoding-decoding hybrid CNN-Transformer change detection model based on dilated spatial pyramid pooling, combining the advantages of visual Transformers and CNN. Firstly, image features are extracted using Siamese CNN, refined through dilated pyramid pooling to better capture detailed feature information. Secondly, the extracted attributes are converted into visual words, and a Transformer encoder models the compact visual words, , feeding the learned context-rich labels back into visual space through a Transformer decoder to reinforce the original features. Thirdly, CNN features are fused with the features from Transformer encoding- decoding through skip connections, , facilitating the fusion of position and semantic information by connecting features of different resolutions through upsampling. Finally, a difference enhancement module generates difference feature maps containing rich change information. Comprehensive experiments conducted on four publicly accessible remote sensing datasets, including LEVIR, , CDD, DSIFN and WHUCD, confirm ,confirm the efficacy of the proposed approach. Compared with other cutting-edge techniques for detecting changes, , the model presented in this paper achieves superior classification performance, effectively addressing issues such as under-segmentation , over-segmentation and rough edge segmentation in change detection results.