Double-Stream Position Learning Transformer Network for Image Captioning

被引：21

作者：

Jiang, Weitao ^{[1
]}

Zhou, Wei ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 11期

关键词：

Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;

D O I：

10.1109/TCSVT.2022.3181490

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.

引用

页码：7706 / 7718

页数：13

共 50 条

[1] Noise Augmented Double-Stream Graph Convolutional Networks for Image Captioning
Wu, Lingxiang
Xu, Min
Sang, Lei
Yao, Ting
Mei, Tao
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (08) : 3118 - 3127
[2] Triple-Stream Commonsense Circulation Transformer Network for Image Captioning
Li, Jianchao
Zhou, Wei
Wang, Kai
Hu, Haifeng
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
[3] Double-stream atrous network for shadow detection
Li, Dawei
Wang, Sifan
Tang, Xue-song
Kong, Weijian
Shi, Guoliang
Chen, Yang
[J]. NEUROCOMPUTING, 2020, 417 : 167 - 175
[4] A Position-Aware Transformer for Image Captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
[5] A position-aware transformer for image captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
[J]. Computers, Materials and Continua, 2021, 70 (01): : 2005 - 2021
[6] Image captioning using transformer-based double attention network
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
[7] Dual Position Relationship Transformer for Image Captioning
Wang, Yaohan
Qian, Wenhua
Nie, Rencan
Xu, Dan
Cao, Jinde
Kim, Pyoungwon
[J]. BIG DATA, 2022, 10 (06) : 515 - 527
[8] Position-guided transformer for image captioning
Hu, Juntao
Yang, You
Yao, Lu
An, Yongzhi
Pan, Longyue
[J]. IMAGE AND VISION COMPUTING, 2022, 128
[9] DOUBLE-STREAM AMPLIFIERS
PIERCE, JR
[J]. PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1949, 37 (09): : 980 - 985
[10] DMFF-Net: Double-stream multilevel feature fusion network for image forgery localization
Xia, Xiang
Su, Li Chao
Wang, Shi Ping
Li, Xiao Yan
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 127

← 1 2 3 4 5 →