Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

被引：12

作者：

Zhang, Jing ^{[1
]}

Xie, Yingshuai ^{[1
]}

Ding, Weichao ^{[1
]}

Wang, Zhe ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 08期

基金：

上海市自然科学基金;

关键词：

Image captioning; deep fusion transformer; global cross encoder; cross on cross attention; LANGUAGE;

D O I：

10.1109/TCSVT.2023.3243725

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-modal features in the encoding stage or the decoding stage, which cannot effectively alleviate the semantic gap. In this paper, we propose a Deep Fusion Transformer (DFT) for image captioning to provide a deep multi-feature and multi-modal information fusion strategy throughout the encoding to decoding process. We propose a novel global cross encoder to align different types of visual features, which can effectively compensate for the differences between features and incorporate each other's strengths. In the decoder, a novel cross on cross attention is proposed to realize hierarchical cross-modal data analysis, extending complex cross-modal reasoning capabilities through the multi-level interaction of visual and semantic features. Extensive experiments conducted on the MSCOCO dataset prove that our proposed DFT can achieve excellent performance and outperform state-of-the-art methods. The code is available at https://github.com/weimingboya/DFT.

引用

页码：4257 / 4268

页数：12

共 50 条

[1] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
Song, Zijie
Hu, Zhenzhen
Zhou, Yuanen
Zhao, Ye
Hong, Richang
Wang, Meng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
[2] Spatio-spectral Cross-Attention Transformer for Hyperspectral image and Multispectral image fusion
Qin, Xilei
Song, Huihui
Fan, Jiaqing
Zhang, Kaihua
[J]. REMOTE SENSING LETTERS, 2023, 14 (12) : 1303 - 1314
[3] Attention-Aligned Transformer for Image Captioning
Fei, Zhengcong
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 607 - 615
[4] Bridging CNN and Transformer With Cross-Attention Fusion Network for Hyperspectral Image Classification
Xu, Fulin
Mei, Shaohui
Zhang, Ge
Wang, Nan
Du, Qian
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[5] Cross modification attention-based deliberation model for image captioning
Lian, Zheng
Zhang, Yanan
Li, Haichang
Wang, Rui
Hu, Xiaohui
[J]. APPLIED INTELLIGENCE, 2023, 53 (05) : 5910 - 5933
[6] Cross modification attention-based deliberation model for image captioning
Zheng Lian
Yanan Zhang
Haichang Li
Rui Wang
Xiaohui Hu
[J]. Applied Intelligence, 2023, 53 : 5910 - 5933
[7] Noise-reducing attention cross fusion learning transformer for histological image classification of osteosarcoma
Pan, Liangrui
Wang, Hetian
Wang, Lian
Ji, Boya
Liu, Mingting
Chongcheawchamnan, Mitchai
Yuan, Jin
Peng, Shaoliang
[J]. BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2022, 77
[8] A Cross-Attention-Based Multi-Information Fusion Transformer for Hyperspectral Image Classification
Yang, Jinghui
Li, Anqi
Qian, Jinxi
Qin, Jia
Wang, Liguo
[J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 13358 - 13375
[9] Relational Attention with Textual Enhanced Transformer for Image Captioning
Song, Lifei
Shi, Yiwen
Xiao, Xinyu
Zhang, Chunxia
Xiang, Shiming
[J]. PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
[10] Stacked cross-modal feature consolidation attention networks for image captioning
Mozhgan Pourkeshavarz
Shahabedin Nabavi
Mohsen Ebrahimi Moghaddam
Mehrnoush Shamsfard
[J]. Multimedia Tools and Applications, 2024, 83 : 12209 - 12233

← 1 2 3 4 5 →