Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

被引:12
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Ding, Weichao [1 ]
Wang, Zhe [1 ]
机构
[1] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
基金
上海市自然科学基金;
关键词
Image captioning; deep fusion transformer; global cross encoder; cross on cross attention; LANGUAGE;
D O I
10.1109/TCSVT.2023.3243725
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-modal features in the encoding stage or the decoding stage, which cannot effectively alleviate the semantic gap. In this paper, we propose a Deep Fusion Transformer (DFT) for image captioning to provide a deep multi-feature and multi-modal information fusion strategy throughout the encoding to decoding process. We propose a novel global cross encoder to align different types of visual features, which can effectively compensate for the differences between features and incorporate each other's strengths. In the decoder, a novel cross on cross attention is proposed to realize hierarchical cross-modal data analysis, extending complex cross-modal reasoning capabilities through the multi-level interaction of visual and semantic features. Extensive experiments conducted on the MSCOCO dataset prove that our proposed DFT can achieve excellent performance and outperform state-of-the-art methods. The code is available at https://github.com/weimingboya/DFT.
引用
收藏
页码:4257 / 4268
页数:12
相关论文
共 50 条
  • [1] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
    Song, Zijie
    Hu, Zhenzhen
    Zhou, Yuanen
    Zhao, Ye
    Hong, Richang
    Wang, Meng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
  • [2] Spatio-spectral Cross-Attention Transformer for Hyperspectral image and Multispectral image fusion
    Qin, Xilei
    Song, Huihui
    Fan, Jiaqing
    Zhang, Kaihua
    [J]. REMOTE SENSING LETTERS, 2023, 14 (12) : 1303 - 1314
  • [3] Attention-Aligned Transformer for Image Captioning
    Fei, Zhengcong
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 607 - 615
  • [4] Bridging CNN and Transformer With Cross-Attention Fusion Network for Hyperspectral Image Classification
    Xu, Fulin
    Mei, Shaohui
    Zhang, Ge
    Wang, Nan
    Du, Qian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [5] Cross modification attention-based deliberation model for image captioning
    Lian, Zheng
    Zhang, Yanan
    Li, Haichang
    Wang, Rui
    Hu, Xiaohui
    [J]. APPLIED INTELLIGENCE, 2023, 53 (05) : 5910 - 5933
  • [6] Cross modification attention-based deliberation model for image captioning
    Zheng Lian
    Yanan Zhang
    Haichang Li
    Rui Wang
    Xiaohui Hu
    [J]. Applied Intelligence, 2023, 53 : 5910 - 5933
  • [7] Noise-reducing attention cross fusion learning transformer for histological image classification of osteosarcoma
    Pan, Liangrui
    Wang, Hetian
    Wang, Lian
    Ji, Boya
    Liu, Mingting
    Chongcheawchamnan, Mitchai
    Yuan, Jin
    Peng, Shaoliang
    [J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 77
  • [8] A Cross-Attention-Based Multi-Information Fusion Transformer for Hyperspectral Image Classification
    Yang, Jinghui
    Li, Anqi
    Qian, Jinxi
    Qin, Jia
    Wang, Liguo
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 13358 - 13375
  • [9] Relational Attention with Textual Enhanced Transformer for Image Captioning
    Song, Lifei
    Shi, Yiwen
    Xiao, Xinyu
    Zhang, Chunxia
    Xiang, Shiming
    [J]. PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
  • [10] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    [J]. Multimedia Tools and Applications, 2024, 83 : 12209 - 12233