Input enhanced asymmetric transformer for image captioning

被引：2

作者：

Zhu, Chenhao ^{[1
]}

Ye, Xia ^{[1
]}

Lu, Qiduo ^{[1
]}

机构：

[1] Xian Res Inst High Tech, Xian 710025, Peoples R China

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2023年 / 17卷 / 04期

关键词：

Image caption; Adaptive sparse attention; Vision; Language pretraining;

D O I：

10.1007/s11760-022-02350-9

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image caption is a popular research direction in computer vision. It is a task that enables machines to convey the computer's perception and cognition of vision to the outside world in the form of human language. Currently, the most dominant models are Transformer-based architectures which achieve the cutting-edge performance. Inspired by the distinguished meshed-memory transformer model which uses a mesh-like connectivity at decoding stage. It let us see more possibilities in the Transformer model. With the aim to explore more possible connectivity schemas in Transformer, we propose the input enhanced asymmetric transformer (IEAT) model. It improves the connectivity between encoder layers and optimizes the generation effect of the captions. To better evaluate the final effect of our model, we conducted extensive experiments (offline evaluation, online evaluation and ablation study) on the MS-COCO benchmark and the "Karpathy" test split. And the results show that IEAT outperforms the previously proposed models to generate satisfactory image captions.

引用

页码：1419 / 1427

页数：9

共 50 条

[41] Efficient Image Captioning Based on Vision Transformer Models
Elbedwehy, Samar
Medhat, T.
Hamza, Taher
Alrahmawy, Mohammed F.
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
[42] External knowledge-assisted Transformer for image captioning
Li, Zhixin
Su, Qiang
Chen, Tianyu
IMAGE AND VISION COMPUTING, 2023, 140
[43] Dual-Spatial Normalized Transformer for image captioning
Hu, Juntao
Yang, You
An, Yongzhi
Yao, Lu
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
[44] Caption TLSTMs: combining transformer with LSTMs for image captioning
Yan, Jie
Xie, Yuxiang
Luan, Xidao
Guo, Yanming
Gong, Quanzhi
Feng, Suru
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
[45] Reinforcement Learning Transformer for Image Captioning Generation Model
Yan, Zhaojie
FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
[46] Improving Stylized Image Captioning with Better Use of Transformer
Tan, Yutong
Lin, Zheng
Liu, Huan
Zuo, Fan
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
[47] Graph Alignment Transformer for More Grounded Image Captioning
Tian, Canwei
Hu, Haiyang
Li, Zhongjin
2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
[48] Visual enhanced gLSTM for image captioning
Zhang, Jing
Li, Kangkang
Wang, Zhenkun
Zhao, Xianwen
Wang, Zhe
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
[49] Visual contextual relationship augmented transformer for image captioning
Su, Qiang
Hu, Junbo
Li, Zhixin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
[50] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
Udeji, Uchechukwu Leo
Margala, Martin
SOUTHEASTCON 2024, 2024, : 1353 - 1357

← 1 2 3 4 5 →