Infrared and Visible Image Fusion Method via Interactive Self-attention

被引:0
|
作者
Yang Fan [1 ]
Wang Zhishe [1 ]
Sun Jing [1 ]
Yu Zhaofa [2 ]
机构
[1] Taiyuan Univ Sci & Technol, Sch Appl Sci, Taiyuan 030024, Peoples R China
[2] Army Engn Univ PLA, Ordnance NCO Acad, Wuhan 430075, Peoples R China
关键词
Image fusion; Self-attention mechanism; Feature interaction; Deep learning; Multi-modality images; NETWORK;
D O I
10.3788/gzxb20245306.0610003
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
The fusion of infrared and visible images aims to merge their complementary information to generate a fused output with better visual perception and scene understanding. The existing CNN-based methods typically employ convolutional operations to extract local features while failing to model the long-range relationships. On the contrary, the Transformer-based methods usually propose a self-attention mechanism to model the global dependencies, but lack the supplement of local information. More importantly, these methods often ignore the specialized interactive information learning of different modalities, which produces limited fusion performance. To address these issues, this paper introduces an infrared and visible image fusion via interactive self-attention, namely ISAFusion. First, we devise a collaborative learning scheme that seamlessly integrates CNN and Transformer. This approach leverages residual convolutional blocks to extract local features, which are then aggregated into the transformer to model the global features, thus enhancing its powerful feature representation abilities. Second, we construct a cross-modality interactive attention module, which is a cascade of Token-ViT and Channel-ViT. This module can model the long-range dependencies from token and channel dimensions in an interactive manner, and allow feature communication between spatial locations and independent channels. The generated global features markedly focus on the intrinsic characteristics of different modality images, which can effectively strengthen their complementary information to achieve better fusion performance. Finally, we end-to-end train the fusion network through a comprehensive objective function encompassing the structural similarity index measure SSIM loss, gradient loss, and intensity loss. This design can ensure the fusion model preserves similar structural information, valuable pixel intensity, and rich texture details from source images. To verify the effectiveness and superiority of the proposed method, we carry out experiments on the three different benchmarks, namely TNO, Roadscene, and (MFD)-F-3 datasets. Meanwhile, seven representative methods, namely U2Fusion, RFN-Nest, FusionGAN, GANMcC, YDTR, SwinFusion, and SwinFuse, are selected for the experimental comparisons. Eight evaluation metrics, such as average gradient, mutual information, phase congruency, feature mutual information with pixel, edge-based similarity measurement, gradient-based similarity measurement, multi-scale structural similarity index measure, and visual information fidelity, are used for the objective evaluation. In the compared experiments, ISAFusion can achieve more balanced fusion results in retaining the typical targets of the infrared image and rich texture details of the visible image, which presents a better visual effect and is more suitable for the human visual system. Meanwhile, from the objective comparison perspective, ISAFusion achieves better fusion performance than other comparable methods in the three different datasets, which is consistent with the subjective analysis. Furthermore, we also conduct experiments to evaluate the operational efficiency of different methods, and experimental results demonstrate our methods is only behind of YDTR, indicating its competitive computation efficiency. To sum up, compared with other seven state-of-the-art competitors, our method presents better image fusion performance, stronger robustness and higher computational efficiency. In addition, we carry on ablation experiments to verify the effectiveness of each designed component. The experimental results indicate that removing any of the components will degrade the fusion performance more or less. More specifically, we find that discarding the position embedding generates a positive effect on the fusion performance. The qualitative and quantitative ablation studies demonstrate the rationality and superiority of each designed component. In the further, we will exploit a more effective CNN-Transformer learning scheme to further promote the fusion performance, and extend it for other fusion tasks, such as multi-band, multi-exposure, multi-focus image fusion, and so on.
引用
收藏
页数:12
相关论文
共 25 条
  • [1] Fusion of Infrared and Visible Images Based on Non-subsampled Dual-tree Complex Contourlet and Adaptive Block
    Deng Hui
    Wang Chang-long
    Hu Yong-jiang
    Zhang Yu-hua
    [J]. ACTA PHOTONICA SINICA, 2019, 48 (07)
  • [2] Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification
    Feng, Zhanxiang
    Lai, Jianhuang
    Xie, Xiaohua
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 579 - 590
  • [3] Han XU, 2020, Roadscene database
  • [4] Infrared and Low-light-level Visible Light Enhancement Image Fusion Method Based on Latent Low-rank Representation and Composite Filtering
    Jiang Ze-tao
    Jiang Qi
    Huang Yong-song
    Zhang Shao-qin
    [J]. ACTA PHOTONICA SINICA, 2020, 49 (04)
  • [5] Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization
    Kong, Weiwei
    Lei, Yang
    Zhao, Huaixun
    [J]. INFRARED PHYSICS & TECHNOLOGY, 2014, 67 : 161 - 172
  • [6] [李辰阳 Li Chenyang], 2020, [红外技术, Infrared Technology], V42, P1042
  • [7] RFN-Nest: An end-to-end residual fusion network for infrared and visible images
    Li, Hui
    Wu, Xiao-Jun
    Kittler, Josef
    [J]. INFORMATION FUSION, 2021, 73 : 72 - 86
  • [8] DenseFuse: A Fusion Approach to Infrared and Visible Images
    Li, Hui
    Wu, Xiao-Jun
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (05) : 2614 - 2623
  • [9] LIU Jinyuan, 2020, M3FD database
  • [10] GANMcC: A Generative Adversarial Network With Multiclassification Constraints for Infrared and Visible Image Fusion
    Ma, Jiayi
    Zhang, Hao
    Shao, Zhenfeng
    Liang, Pengwei
    Xu, Han
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70