Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

被引：15

作者：

Castro, Roberto ^{[1
]}

Pineda, Israel ^{[1
]}

Lim, Wansu ^{[2
]}

Morocho-Cayamcela, Manuel Eugenio ^{[1
]}

机构：

[1] Yachay Tech Univ, Sch Math & Computat Sci, Deep Learning Autonomous Driving Robot & Comp Vis, Urcuqui 100119, Ecuador

[2] Kumoh Natl Inst Technol, Dept Aeronaut Mech & Elect Convergence Engn, Future Commun & Syst Lab FCSL, Gumi Si 39177, Gyeongbuk, South Korea

来源：

IEEE ACCESS | 2022年 / 10卷

基金：

新加坡国家研究基金会;

关键词：

Transformers; Computer architecture; Task analysis; Decoding; Measurement; Computational modeling; Computer vision; Image captioning; visual attention; computer vision; supervised learning; artificial intelligence;

D O I：

10.1109/ACCESS.2022.3161428

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper focuses on visual attention, a state-of-the-art approach for image captioning tasks within the computer vision research area. We study the impact that different hyperparemeter configurations on an encoder-decoder visual attention architecture in terms of efficiency. Results show that the correct selection of both the cost function and the gradient-based optimizer can significantly impact the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, and negative log-likelihood loss functions; the adaptive momentum (Adam), AdamW, RMSprop, stochastic gradient descent, and Adadelta optimizers. Experimentation shows that a combination of cross-entropy with Adam is the best alternative returning a Top-5 accuracy value of 73.092 and a BLEU-4 value of 20.10. Furthermore, a comparative analysis of alternative convolutional architectures demonstrated their performance as an encoder. Our results show that ResNext-101 stands out with a Top-5 accuracy of 73.128 and a BLEU-4 of 19.80; positioning itself as the best option when looking for the optimum captioning quality. However, MobileNetV3 proved to be a much more compact alternative with 2,971,952 parameters and 0.23 Giga fixed-point Multiply-Accumulate operations per Second (GMACS). Consequently, MobileNetV3 offers a competitive output quality at the cost of lower computational performance, supported by values of 19.50 and 72.928 for the BLEU-4 and Top-5 accuracy, respectively. Finally, when testing vision transformer (ViT), and data-efficient image transformer (DeiT) models to replace the convolutional component of the architecture, DeiT achieved an improvement over ViT, obtaining a value of 34.44 in the BLEU-4 metric.

引用

页码：33679 / 33694

页数：16

共 50 条

[1] Automatic Bangla Image Captioning Based on Transformer Model in Deep Learning
Hossain, Md Anwar
Hasan, Mirza A. F. M. Rashidul
Hossen, Ebrahim
Asraful, Md
Faruk, Md Omar
Abadin, A. F. M. Zainul
Ali, Md Suhag
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (11) : 1110 - 1117
[2] Deep Learning Approaches on Image Captioning: A Review
Ghandi, Taraneh
Pourreza, Hamidreza
Mahyar, Hamidreza
[J]. ACM COMPUTING SURVEYS, 2024, 56 (03)
[3] Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach
Faisal Muhammad Shah
Mayeesha Humaira
Md Abidur Rahman Khan Jim
Amit Saha Ami
Shimul Paul
[J]. SN Computer Science, 2022, 3 (1)
[4] A Review of Transformer-Based Approaches for Image Captioning
Ondeng, Oscar
Ouma, Heywood
Akuon, Peter
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
[5] A performance analysis of transformer-based deep learning models for Arabic image captioning
Alsayed, Ashwaq
Qadah, Thamir M.
Arif, Muhammad
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)
[6] Image Captioning using Deep Neural Architectures
Shah, Parth
Bakrola, Vishvajit
Pati, Supriya
[J]. 2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
[7] Transformer based Multitask Learning for Image Captioning and Object Detection
Basak, Debolena
Srijith, P. K.
Desarkar, Maunendra Sankar
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 260 - 272
[8] Deep learning-based solar image captioning
Baek, Ji-Hye
Kim, Sujin
Choi, Seonghwan
Park, Jongyeob
Kim, Dongil
[J]. ADVANCES IN SPACE RESEARCH, 2024, 73 (06) : 3270 - 3281
[9] Deep Learning for Military Image Captioning
Das, Subrata
Jain, Lalit
Das, Amp
[J]. 2018 21ST INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2018, : 2165 - 2171
[10] Image Captioning using Deep Learning
Jain, Yukti Sanjay
Dhopeshwar, Tanisha
Chadha, Supreet Kaur
Pagire, Vrushali
[J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTATIONAL PERFORMANCE EVALUATION (COMPE-2021), 2021,

← 1 2 3 4 5 →