Transformer-based local-global guidance for image captioning

被引：11

作者：

Parvin, Hashem ^{[1
]}

Naghsh-Nilchi, Ahmad Reza ^{[1
]}

Mohammadi, Hossein Mahvash ^{[1
]}

机构：

[1] Univ Isfahan, Fac Comp Engn, Dept Artificial Intelligence, Esfahan, Iran

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 223卷

关键词：

Attention; Transformer; Encoder; -decoder; Image captioning; Deep learning; ATTENTION; NETWORK; GENERATION;

D O I：

10.1016/j.eswa.2023.119774

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a difficult problem for machine learning algorithms to compress huge amounts of images into descriptive languages. The recurrent models are popularly used as the decoder to extract the caption with sig-nificant performance, while these models have complicated and inherently sequential overtime issues. Recently, transformers provide modeling long dependencies and support parallel processing of sequences compared to recurrent models. However, recent transformer-based models assign attention weights to all candidate vectors based on the assumption that all vectors are relevant and ignore the intra-object relationships. Besides, the complex relationships between key and query vectors cannot be provided using a single attention mechanism. In this paper, a new transformer-based image captioning structure without recurrence and convolution is proposed to address these issues. To this end, a generator network and a selector network to generate textual descriptions collaboratively are designed. Our work contains three main steps: (1) Design a transformer-based generator network as word-level guidance to generate next words based on the current state. (2) Train a latent space to learn the mapping of captions and images into the same embedding space to learn the text-image relation. (3) Design a selector network as sentence-level guidance to evaluate next words by assigning fitness scores to the partial captions through the embedding space. Compared with the architecture of existing methods, the proposed approach contains an attention mechanism without the dependencies of time. It executes each state to select the next best word using local-global guidance. In addition, the proposed model maintains dependencies between the sequences, and can be trained in parallel. Several experiments on the COCO and Flickr datasets demonstrate that the proposed approach can outperform various state-of-the-art models over well-known evaluation measures.

引用

页数：20

共 50 条

[1] A Sparse Transformer-Based Approach for Image Captioning
Lei, Zhou
Zhou, Congcong
Chen, Shengbo
Huang, Yiyong
Liu, Xianrui
[J]. IEEE ACCESS, 2020, 8 : 213437 - 213446
[2] A Sparse Transformer-Based Approach for Image Captioning
Lei, Zhou
Zhou, Congcong
Chen, Shengbo
Huang, Yiyong
Liu, Xianrui
[J]. IEEE Access, 2020, 8 : 213437 - 213446
[3] ThaiTC:Thai Transformer-based Image Captioning
Jaknamon, Teetouch
Marukatat, Sanparith
[J]. 2022 17TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2022) / 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (AIOT 2022), 2022,
[4] A Review of Transformer-Based Approaches for Image Captioning
Ondeng, Oscar
Ouma, Heywood
Akuon, Peter
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
[5] Local-global visual interaction attention for image captioning
Wang, Changzhi
Gu, Xiaodong
[J]. DIGITAL SIGNAL PROCESSING, 2022, 130
[6] Transformer-based image captioning by leveraging sentence information
Chahkandi, Vahid
Fadaeieslam, Mohammad Javad
Yaghmaee, Farzin
[J]. JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
[7] Image captioning using transformer-based double attention network
Parvin, Hashem
Naghsh-Nilchi, Ahmad Reza
Mohammadi, Hossein Mahvash
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
[8] Effective Local-Global Transformer for Natural Image Matting
Hu, Liangpeng
Kong, Yating
Li, Jide
Li, Xiaoqiang
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 3888 - 3898
[9] Explaining transformer-based image captioning models: An empirical analysis
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
[J]. AI COMMUNICATIONS, 2022, 35 (02) : 111 - 129
[10] Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach
Faisal Muhammad Shah
Mayeesha Humaira
Md Abidur Rahman Khan Jim
Amit Saha Ami
Shimul Paul
[J]. SN Computer Science, 2022, 3 (1)

← 1 2 3 4 5 →